^^KOOJCAiaTAfilCfe 


APR    6«fi& 


■ASSIST 


Q           UAR          TERL          Y 

VOLUME  17                                                        Fall/Winter  1993                                                          NUMBER  3&4 

Digitized  by  the  Internet  Arciiive 

in  2010  with  funding  from 

University  of  North  Carolina  at  Chapel  Hill 


http://www.archive.org/details/iassistquarterly173inte 


lASSIST 

Q  UAR  TERL  Y 

VOLUME  17  FallAVinter  1993  NUMBER  3&4 


Printed  at  UCLA 


lASSIST 

QUARTERLY 


The  IASSIST  QUARTERLY  represents  an  international  cooperative 
effort  on  the  part  of  individuals  managing,  operating,  or  using 
machine-readable  data  archives,  data  libraries,  and  data  services.  The 
QUARTERLY  reports  on  activities  related  to  the  production, 
acquisition,  preservation,  processing,  distribution,  and  use  of  machine- 
readable  data  carried  out  by  its  members  and  others  in  the  international 
social  science  community.   Your  contributions  and  suggestions  for 
topics  of  interest  are  welcomed.  The  views  set  forth  by  authors  of 
articles  contained  in  this  publication  are  not  necessarily  those  of 

iassist. 

Information  for  Authors 

The  QUARTERLY  is  published  four  times  per  year.  Articles  and 
other  information  should  be  typewritten  and  double-spaced.  Each  page 
of  the  manuscript  should  be  numbered.  The  first  page  should  contain 
the  article  title,  author's  name,  affiliation,  address  to  which  correspon- 
dence may  be  sent,  and  telephone  number.   Footnotes  and  biblio- 
graphic citations  should  be  consistent  in  style,  preferably  following  a 
standard  authority  such  as  the  University  of  Chicago  press  Manual  of 
Style  or  Kate  L.  Turabian's  Manual  for  Wrilers.  Where  appropriate, 
machine-readable  data  files  should  be  cited  with  bibliographic  citations 
consistent  in  style  with  Dodd,  Sue  A.  "Bibliographic  references  for 
numeric  social  science  data  fdes:  suggested  guidelines ".  Journal  of  the 
American  Society  for  Information  Science  30(2):77-82,  March  1979. 
If  the  contribution  is  an  announcement  of  a  conference,  training 
session,  or  the  like,  the  text  should  include  a  mailing  address  and  a 
telephone  number  for  the  director  of  the  event  or  for  the  organization 
sponsoring  the  event.   Book  notices  and  reviews  should  not  exceed  two 
double-spaced  pages.   Deadlines  for  submitting  articles  are  six  weeks 
before  publication.   Manuscripts  should  be  sent  in  duplicate  to  the 
Editor:  Walter  Piovesan,  Research  Data  Library,  W.A.C.  Bennett 
Library,  Simon  Fraser  University,  Bumaby,  B.C.,  V5A  1S6 
CANADA.  (604)  291-4349  E-Mail:  USERDLIB<ffiSFU.BITNET 
Book  reviews  should  be  submitted  in  duplicate  to  the  Book  Review 
Editor:  Daniel  Tsang,  Main  Library,  University  of  California  P.O.  Box 
19557,  Irvine,  Califomia  92713  USA.  (714)  856-4978  E-Mail: 
DTSANG@ORION.CF.UCI.EDU 

Title:  Newsletter  -  International  Association  for 
Social  Science  Information  Service  and 
Technology 

ISSN  -  United  Stales;  0739-1137  Copyright  1985  by 
iassist.  All  rights  reserved 


CONTENTS 


Volume  17       Number  3/4 


Fall/Winter  1993 


FEATUR  27 


Setting  Up  a  National  On-line  Census  Data 
Service 

by  Virginia  Knight 

Problems  of  Harmonising  UK-Wide  Samples 
of  Anonymised  Records  from  the  1991 
Census 

by  Elizabeth  Middleton 

Partitioned  Distributed  Computing 

by  George  Yates 

Integration  of  "Traditional"  Data  Centre 
Services  With  GIS  Technology 

by  Jeffery  Moon 

BatchCensus  /  GetCensus:  1990  U.S.  Census, 
Electronic  Data  Distribution  Software. 

by  Fred  Nick 

The  Need  for  Revised  Data  Documentation 
Standards:  New  Solutions  for  Old 
Problems. 

by  Laura  A.  Guy 

Starting  A  Data  Library  Service. 

by  Gaetan  Drolet 


iassist  95  Conference 
iassist  Officers 


Setting  Up  a  National  On-line  Census  Data  Service 


by  Virginia  Knight ' 
Census  Dissemination  Unit 
Manchester  Computing  Centre 
University  of  Manchester 


The  ESRC,  together  with  the  Information  Systems 
Committee  of  the  Universities  Funding  Council,  have  set 
up  a  Census  Initiative  at  a  cost  of  some  3. 1  million  to 
make  information  from  the  1991  Census  available  to 
academics  for  teaching  and  research.  The  Initiative  has 
purchased  data  from  the  Census,  set  up  units  to  support 
it,  and  funded  programmes  of  research,  training  and 
deveolopment.  As  part  of  the  Initiative,  datasets  derived 
from  the  Census  are  being  held  at  Manchester  Comput- 
ing Centre  and  the  Census  Dissemination  Unit  has  been 
set  up  from  1992  to  1997  to  support  them.  Manchester 
Computing  Centre,  as  well  as  supporting  computing  at 
Manchester  University  and  UMIST,  acts  as  a  national 
computing  centre  for  the  dissemination  and  support  of  a 
number  of  large  datasets,  such  as  the  General  Household 
Survey  and  the  Family  Expenditure  Survey.  It  was  seen 
as  an  appropriate  place  to  hold  the  1991  Census  datasets 
since  they  are  large  and  complex  and  require  specialist 
support. 

A  similar  initiative  existed  for  the  1981  Census.  In  this 
case,  the  data  was  distributed  from  regional  computing 
centres  at  London,  Bath,  Edinburgh,  Manchester,  Aber- 
deen and  Newcasde;  this  distributed  scheme  reflects  the 
early  developments  in  networks  in  the  early  1980's.  The 
data  was  held  at  the  ESRC  Data  Archive,  which,  how- 
ever, did  not  provide  on-line  access,  although  they  had  a 
national  support  post,  whose  holder  ran  courses  produced 
documentation,  supplied  data  to  regional  centres  and 
gave  advice.  A  software  package,  SASPAC,  was 
developed  by  a  team  at  Durham  and  Edinburgh  Universi- 
ties which  could  extract  and  manipulate  the  data.  The 
data  was  held  on  the  Manchester  mainframe  and  Man- 
chester Computing  Centre  produced  a  two- volume 
manual  which  described  the  SASPAC  package,  and 
explained  how  to  run  programs  using  the  data  from 
Manchester.  It  was  apparent,  however,  that  the  support 
was  required  at  the  point  of  access  and  supply.  The 
Computer  Board  (the  f)redecessor  of  the  ISC-UFC) 
funded  an  additional  support  post  at  Manchester  to  help 
users  to  access  the  data  on-line  and  to  develop  the  on-line 
service,  for  example  by  implementing  specialist  software 
for  the  Special  Workplace  and  Migration  Statistics,  and 
to  produce  documentation  and  provide  research  support. 
The  support  post  proved  sucessful  enough  to  establish 
Manchester  as  a  national  datasets  centre  with  consider- 
able expertise  in  hangling  and  analysing  Census  data. 


For  1991,  it  was  decided  to  combine  the  support  posts 
which  had  existed  at  the  Data  Archive  and  at  Manchester 
in  a  single  unit. 

The  datasets  supplied  by  the  Census  Offices  under  the 
Initiative  comprise  the  following:  the  Small  Area  and 
Local  Base  Statistics,  which  are  sets  of  tables  covering 
all  aspects  of  the  Census,  the  Special  Migration  and 
Special  Workplace  Statistics,  which  are  matrices  dealing 
with  migration  and  journey  to  work,  a  f)ostcode/ED 
directory,  and  digitised  boundary  data  (compiled  by  the 
EDLINE  consortium)  corresponding  to  various  geo- 
graphical zones  for  which  Census  data  is  being  released. 
Although  the  sample  of  microdata  from  the  1991  Census 
is  also  being  held  at  Manchester,  it  is  being  handled  and 
supported  by  the  Census  Microdata  Unit,which  is 
entirely  separate,  though  working  in  close  collaboration 
with  us.  CHEST  has  also  purchased  the  1991  version  of 
SASPAC,  developed  by  the  London  Research  Centre. 

The  timetable  originally  planned  for  the  release  of  the 
datasets  was  a  long  way  from  what  actually  happened! 
Because  of  a  misunderstanding  in  the  coding  of  the  term- 
time  address  of  students,  there  was  a  six  montii  delay  in 
processing  them  at  the  Census  Offices.   The  Small  Area 
and  Local  Base  Statistics  should  have  been  released  from 
the  beginning  of  1992,  so  when  I  began  in  my  present 
post  in  June  of  that  year,  we  were  expecting  them  to 
arrive  within  a  few  weeks.  In  July  we  received  data  for 
three  counties,  but  on  inspection  we  found  that  some  of 
the  tables  were  incomplete;  for  example,  there  appeared 
to  be  no  people  over  75!  It  turned  out  that  the  program 
which  had  written  the  raw  data  was  defective,  and  OPCS 
had  to  produce  a  corrected  version  of  it.  The  data 
produced  by  this  started  to  come  out  at  the  end  of  August 
and  arrived  during  the  autumn  at  the  rate  of  about  two 
counties  a  week. 

The  Small  Area  and  Local  Base  Statistics  came  out  in 
two  waves,  because  the  tables  are  of  two  kinds;  most 
tables  cover  the  entire  population,  but  some  were  only 
produced  for  a  10%  sample.  The  100%  tables  came  out 
first,  and  were  more  or  less  complete  at  the  beginning  of 
this  year;  the  10%  tables  started  to  come  out  in  January 
1993,  and  we  finished  loading  them  at  the  end  of  April. 

The  100%  or  10%  data  for  a  county  would  typically 


■ASSIST  Quarterly 


arrive  on  five  magnetic  tapes,  making  a  total  of  over  700 
tapes.  After  our  tape  librarian  had  allocated  them  a  place 
in  the  tape  library,  I  would  copy  the  tapes  on  to  a  mini- 
disk. For  England  and  Wales,  the  county  and  district 
level  files  were  combined,  and  had  to  be  split  up.  These 
would  be  raw  data  files,  which  would  then  have  to  be 
reloaded  as  SASPAC  system  files.  This  process  could 
normally  be  completed  within  24  hours.  Inevitably,  it 
would  be  held  up  if  several  counties  arrived  at  once,  and 
to  avoid  long  delays,  we  devised  a  way  of  loading  the 
data  as  system  files  offline.  Thus  I  could  be  copying  one 
lot  of  raw  data  files  off  the  tapes  while  in  the  background 
another  county  was  being  converted  into  system  files. 

It  was  not  all  plain  sailing,  however.  A  number  of  errors 
in  the  raw  data  meant  that  the  first  attempt  to  convert  it 
into  system  files  failed.  These  were  of  two  kinds;  a 
displaced  field  in  the  header  record  at  district  level  in  nine 
counties,  and  impossible  grid  references  in  another  ten. 
When  this  happened,  the  raw  data  would  have  to  be 
emended.  When  the  file  concerned  was  an  ED  level  file, 
it  was  too  large  for  our  screen  editor  to  be  used,  and  a 
FORTRAN  program  had  to  be  written  to  change  it  The 
Census  Dissemination  Unit  had  a  valuable  role  to  play  in 
performing  quality  assurance  on  the  data.  The  Census 
Offices  are  unable  to  do  this  themselves  since  they  can 
only  extract  six  areas  at  a  time!  There  were  also  blank 
tapes,  defective  tapes,  tapes  with  the  wrong  data  on  and 
files  which  were  incomplete. 

In  addition,  the  sheer  size  of  some  files  caused  problems. 
The  largest  single  file  was  the  100%  ED  level  file  for 
Outer  London,  which  was  320  Mb.  This  was  split  across 
three  tapes.  Manchester  Computing  Centre  is  used  to 
coping  with  multi-volume  files,  but  a  further  problem 
arose  when  the  file  was  too  large  to  fit  onto  one  of  our 
backup  cartridges.  No  one  had  ever  needed  to  copy  a  file 
that  large  on  to  cartridge  before,  but  eventually  we 
worked  out  a  way  of  splitting  it  between  two  cartridges. 

SASPAC  is  at  heart  a  PC  package,  and  the  mainframe 
version  has  had  to  be  created  from  the  PC  version.  This 
has  been  done  at  Manchester  for  our  CMS  system,  and  a 
similar  process  is  going  on  here  in  Edinburgh. 

At  present  Manchester  Computing  Centre  is  making  the 
data  available  in  two  main  forms;  we  are  holding  it  on  our 
mainframe  both  as  raw  data  and  as  system  files,  and  we 
are  also  supplying  it  as  PC  SASPAC  system  files. 

The  total  raw  data  for  Great  Britain  at  all  areal  levels 
comes  to  some  6-7  gigabytes.  We  have  been  allocated  a 
certain  amount  of  minidisk  space  on  the  Amdahl  at 
Manchester  to  hold  iL  This  consists  of  5  minidisks  each 
with  a  capacity  of  1.2  gigabytes.  We  have  filled  one  and 
two-thirds  of  a  second  of  these  with  system  files,  but 
there  is  not  sufficient  space  on  the  others  to  keep  all  the 


raw  data.  As  a  compromise,  we  keep  all  the  raw  data 
files  in  prime  space,  except  for  the  lowest  level  files  (for 
Enumeration  Districts/Output  Areas)  which  are  the 
largest  ones.  The  complete,  corrected  raw  data  has  been 
backed  up  on  to  cartridges  which  can  be  loaded  by  robot, 
and  any  files  which  are  not  online  can  be  quickly  re- 
called. 

Our  tape  librarian  has  been  creating  PC  SASPAC  system 
files  from  the  corrected  raw  data.  These  are  not  held  on 
line;  the  only  need  to  do  this  is  if  someone  at  another  site 
wants  to  transfer  the  data  in  this  form.  If  we  receive  such 
a  request  we  can  put  the  relevant  file  on  line  until  it  has 
been  transferred. 

We  have  a  number  of  ways  of  sending  the  data  to  other 
sites.  We  encourage  users  to  help  themselves  to  the  data 
they  need  where  this  is  practical,  firstly  giving  them 
access  to  the  appropriate  minidisks.  Newcastle  Univer- 
sity have  been  successfully  using  TCP/IP  coloured  book 
to  transfer  all  the  raw  data;  FTP  will  also  work,  but  is 
slower.  We  can  send  data  to  other  sites  ourselves,  though 
the  success  of  this  depends  on  the  bandwidth  of  the 
JANET  link  for  the  receiving  machine.  An  attempt  to 
FTP  a  large  raw  data  file  to  Portsmouth  University 
managed  to  bring  their  mainframe  to  a  standstill.   When 
it  proves  impossible  to  transfer  data  over  JANET,  we  are 
prepared  to  send  the  data  on  floppy  disks  or  on  magnetic 
tape;  for  this  service  there  is  a  media  charge  to  cover  the 
cost  of  the  materials  used  and  the  time  to  copy  the  files. 
There  is  no  charge  for  the  data  as  such,  because  it  has 
already  been  paid  for  under  the  Census  Initiative. 

Manchester  Computing  Centre  has  also  developed  a 
sideline  in  supplying  data  to  Local  Authorities  in  the 
form  of  PC  SASPAC  system  files  on  floppy  disk  or  t^)e 
streamer.  This  is  because  the  Census  Offices  have  been 
unable  to  cope  with  the  demand  for  these  datasets. 

Most  users  are  accessing  the  data  on  the  Manchester 
mainframe.  This  requires  that  they  get  a  Manchester 
userid,  which  can  be  obtained  via  a  Manchester  rep  at  the 
computing  centre  at  their  institution.  Where  there  is  no 
rep,  we  can  deal  with  them  direcdy.  The  advantages  of 
this  method  are  that  all  the  data  is  quickly  accessible  on  a 
powerful  machine.  Any  problems  with  the  data  can  be 
reported  to  me  and  sorted  out  directly;  there  is  no 
problem  with,  for  example,  supplying  a  corrected  version 
of  a  dalaset  to  a  remote  machine.  There  is  also  a  full 
range  of  support  and  a  proper  user  note.  Incidentally,  it 
enables  us  to  monitor  use  of  the  data  easily,  both  by 
counting  the  number  of  users  currently  registered  and  by 
using  the  project  accounting  package  VMACCOUNT  to 
find  out  the  amount  of  CPU  time  which  has  been  used  up 
and  by  whom. 

The  drawbacks,  from  the  user's  point  of  view,  are  that 


Fall/Winter  1993 


they  have  to  learn  how  to  use  the  CMS  system.  Even  a 
computer-literate  user  brought  up  on  a  different  type  of 
mainframe  may  find  the  nested  screens  in  CMS  and  the 
XEDIT  editor  offputting  at  fu-st.  Moreover,  some  sites 
have  difficulties  in  getting  full -screen  CMS  on  many  of 
their  terminals.  One  way  of  getting  round  this  is  to  use 
JTMP  to  run  jobs  remotely.  A  job  can  be  submitted  from 
the  user's  own  mainframe,  sent  to  Manchester,  and  the 
output  will  be  sent  back  along  the  same  route.  Once  a 
workable  JTMP  framework  has  been  set  up,  there  is  no 
need  for  the  user  to  log  on  directly  to  CMS  at  all.  Oxford 
and  Bangor  are  using  the  JTMP  method  profitably. 

A  number  of  sites  are  holding  the  data  on  their  own 
mainframe.  A  few  are  taking  all  of  it;  these  include 
Newcastle  (for  the  use  of  a  specific  project)  and  Edin- 
burgh. Because  of  the  size  of  the  data  this  is  a  large 
commiDnent  to  space  which  is  only  worthwhile  where  a 
large  number  of  people  want  to  look  at  a  lot  of  the  data. 
Other  sites  are  just  taking  a  subset  of  the  data,  perhaps 
their  own  county  and  a  few  others  nearby.  If  they  are 
holding  it  on  their  mainframe,  they  have  to  devise  their 
own  implementation  of  S ASPAC. 

PC  SASPAC  is  being  made  available  under  the  CHEST 
deal  to  academic  sites  who  want  to  hold  it.  The  cost  of  a 
site  licence  is  just  under  3,000  for  five  years,  and  about 
15  sites  have  taken  up  this  offer.  However,  some  sites 
which  have  have  received  the  software  under  this  deal 
have  not  requested  any  data,  which  suggests  that  they  do 
not  have  the  resources  to  run  a  properly  supported  1991 
Census  data  service.  There  are  certain  advantages  to  the 
PC  version;  it  is  menu-driven  and  people  do  not  have  to 
learn  any  computer  system  to  be  able  to  use  it.  Instead, 
they  are  guided  through  menus  by  selecting  items.  The 
drawbacks  are  that  it  is  significantly  slower  than  the 
mainframe  version,  and  the  user  is  limited  to  the  amount 
of  data  which  can  be  fitted  on  to  a  PC.  Even  allowing  for 
the  impressive  compression  which  SASPAC  performs  on 
raw  data,  it  still  takes  up  a  large  amount  of  space.  For 
example,  the  data  for  Cambridgeshire  is  reduced  from 
over  100  MB  of  raw  data  files  to  18  MB  of  PC  system 
files.    PC  SASPAC  is  therefore  not  useful  to  someone 
who  will  want  to  study  low-level  data  for  a  large  number 
of  counties,  though  it  is  suitable  for  someone  who  is  only 
interested  in  one  or  two  and  not  fond  of  computers!  For 
this  reason  it  has  proved  popular  with  Local  Authorities. 

There  is  another  package  for  accessing  the  Census  data, 
C91.  This  was  developed  by  Powys  County  Council  and 
runs  on  PC's  only.  We  do  not  support  C91  but  we  are 
prepared  to  supply  raw  data  to  people  who  have  it.  It  is 
also  possible  to  read  the  raw  data  directly  using  SPSS; 
this  is  not  for  the  faint  hearted! 

Where  sites  are  distributing  the  data  locally,  in  whatever 


form,  they  must  set  up  their  own  local  registration 
system.  We  have  produced  guidelines  for  local  registra- 
tion forms  and  conditions  of  use,  which  should  be  based 
on  ours,  and  approve  the  proposed  forms  and  conditions 
before  the  system  is  set  up. 

The  Census  Offices  in  Great  Britain  are  very  concerned 
that  the  datasets  should  not  be  misused,  and  the  registra- 
tion procedure  has  been  developed  with  this  in  mind. 
The  conditions  of  use  to  which  all  users  are  required  to 
agree  insist  among  other  things  that  users  should  not 
attempt  to  identify  particular  individuals  or  households, 
that  the  data  should  only  be  used  for  academic  purposes, 
and  that  any  pubhcations  which  make  extensive  use  of 
the  data  should  acknowledge  this.  We  act  as  'gatekeep- 
ers' by  advising  on  hcensing  arrangements  for  the  use  of 
the  data.  When  a  user  is  supplying  data  to  commercial 
organisations,  they  must  contact  the  Census  Offices  for 
permission  and  the  status  of  their  account  at  MCC  (if 
they  have  one)  will  be  affected.  If  they  are  using  data  on 
behalf  of  an  organisation  in  the  public  sector,  such  as  a 
local  authority,  the  Sports  Council  or  the  Equal  Opportu- 
nities Commission,  there  may  be  no  need  to  pay  for  the 
data  or  to  contact  the  Census  Offices,  but  again  they  may 
have  to  pay  for  their  use  of  the  Manchester  mainframe. 

A  typical  user  will  contact  me  for  a  registration  pack,  or 
collect  one  from  his  or  her  institution.  The  registration 
pack  contains  a  registration  form,  an  order  form  for  the 
SASPAC  manual,  the  Conditions  of  Use  and  a  descrip- 
tion of  the  datasets  by  Professor  Philip  Rees,  the  head  of 
the  Census  Initiative.  In  addition  to  the  standard  ques- 
tions about  name,  address,  etc.  the  registration  form  asks 
for  a  keyword  description  of  the  user's  interests,  the 
length  of  time  for  which  the  data  is  required  and  the 
source  of  funding.  The  completed  form  is  then  returned 
to  me,  and  I  add  the  user  to  the  list  of  people  who  can 
access  the  minidisks  on  which  the  data  is  held.  I  return 
to  them  a  copy  of  their  form,  together  with  a  guide  to 
accessing  the  data  at  MCC.  Currently  we  have  registered 
about  400  people  by  this  method.  Most  of  our  users  are 
in  geography,  social  science  and  medical  departments. 

The  Census  Offices  require  that  anyone  who  is  using  the 
data  should  assent  to  the  Conditions  of  Use,  and  to  cut 
down  on  paperwork,  we  have  devised  a  class  registration 
form,  to  be  used  when  a  class  is  accessing  the  data.  The 
lecturer  completes  the  first  page,  and  then  the  students 
should  all  sign  on  the  next  one.  Similarly,  if  one  person 
is  obtaining  data  on  behalf  of  another,  they  should  both 
be  registered.  This  often  happens,  for  example,  when  a 
research  assistant  pulls  information  off  for  a  less  com- 
puter literate  academic. 

The  postcode  to  ED  directory  has  been  put  on  line  as  a 
set  of  raw  data  files,  one  for  each  county  and  one  for  the 


■ASSIST  Quarterly 


whole  of  England/Wales,  together  with  an  online  guide  to 
them. 

We  also  offer  various  kinds  of  support  to  users.  I  have 
devised  a  half-day  course  on  the  Census  which  I  have 
given  at  Manchester  and  at  about  twenty-five  other 
institutions.  The  audiences  for  these  have  been  made  up 
of  j)eople  who  are  already  registered  as  well  as  potential 
users.  The  course  consists  of  an  hour  talking  about  the 
Census,  the  Initiative  and  the  various  dalasets  available, 
followed  by  another  hour  of  demonstration  of  SASPAC 
programs  and  sometimes  hands-on  exercises  at  the  end. 
The  handout  accompanying  the  course,  which  contains 
about  fifteen  sample  programs  demonstrating  the  func- 
tionality of  SASPAC,  has  proved  a  useful  supplement  to 
the  manual,  where  the  examples  are  less  geared  to  the 
needs  of  the  typical  academic  user.  I  have  also  given 
presentations  at  workshops  at  Cardiff  run  under  the 
Census  Initiative  and  spoken  at  and  attended  other 
relevant  meetings. 

We  run  an  email  list  on  the  Census  datasets,  which  is  to 
be  transferred  to  the  MAILBASE  system  at  Newcastle; 
any  message  sent  to  the  list  will  be  sent  as  email  to  all 
subscribers.  It  is  intended  that  any  subscriber  can  post  a 
message  to  the  list;  however,  most  of  the  messages  have 
been  from  me,  telling  the  list  about  the  latest  data  to 
arrive  and  about  future  runnings  of  my  course.  This  has 
proved  an  efficient  way  of  informing  users  about  latest 
developments,  and  was  particularly  important  when  the 
data  was  still  arriving.  There  is  a  bulletin  box  which  can 
be  displayed  on  CMS  using  the  command  CENSUS91; 
this  is  also  displayed  whenever  a  SASPAC  job  is  run. 
We  have  devised  a  'help'  system  which  will  look  up  the 
identifying  codes  for  districts  and  wards;  like  other  help 
systems  at  Manchester,  this  works  by  presenting  the  user 
with  a  list  of  options,  one  of  which  can  be  selected  with 
the  cursor.  We  intend  to  develop  another  help  system 
which  will  give  sample  programs,  based  on  the  examples 
in  my  handout.  There  are  a  number  of  helpful  files 
onUne,  which  can  be  read  and  downloaded  by  registered 
users.  There  is  also  a  section  on  the  MSS  bulletin  board, 
which  is  occasionally  updated  as  further  data  arrives;  to 
date,  it  has  been  consulted  over  300  times. 

Manchester  Computing  Centre  has  repninted  the  SAS- 
PAC manual  produced  by  the  London  Research  Centre, 
which  comes  in  two  volumes,  one  describing  the  lan- 
guage and  one  giving  outlines  of  the  tables.  Roz  Scott 
Huxley  at  MCC  has  also  written  a  user  note  for  SASPAC 
on  the  Manchester  mainframe;  in  addition  to  explaining 
commands  relating  to  SASPAC,  it  also  assumes  the  user 
is  not  familiar  with  CMS  and  the  Xedit  editor,  and  takes 
them  step  by  step  through  creating  a  command  file, 
running  it  interactively,  offline  or  remotely  and  collecting 
the  output.  1  also  write  regularly  for  the  local  and 


national  newsletters  produced  by  MCC,  and  for  the 
ESRC  Data  Archive  Bulletin. 

We  are  also  prepared  to  answer  queries  fitom  people  who 
are  having  problems  in  using  SASPAC,  whether  by 
telephone  or  email.  There  is  not  a  very  large  volume  of 
these,  as  many  people  get  help  from  another  user  at  their 
institution.  Often  the  problem  can  be  solved  at  once; 
other  questions  have  shown  up  problems  with  the  data  or 
with  the  SASPAC  software.  In  the  former  case  these  can 
be  referred  back  to  the  Census  Offices  and  in  the  latter  to 
the  London  Research  Centre. 

Future  work  will  include  loading  the  Special  Migration 
Statistics,  the  Special  Workplace  Statistics,  the  digitised 
boundary  data  and  the  appropriate  registration  proce- 
dures for  them  as  well  as  developing  the  online  help 
system.  We  will  also  have  to  transfer  the  data  to  the 
new  machine  at  Manchester  when  it  is  installed  towards 
the  end  of  1993  and  implement  SASPAC  on  the  UNIX 
system  which  will  be  mounted  on  it. 

1.  Presented  at  L\SSIST/IFDO  93  Conference  held  in 
Edinburgh,  Scotland.  May  1993. 


Fall/Winter   1993 


Problems  of  Harmonising  UK-Wide  Samples  of  Anonymised 
Records  from  the  1991  Census 


by  Elizabeth  Middleton ' 

Census  Microdata  Unit 

Faculty  of  Economics  and  Social  Studies 

University  of  Manchester  Manchester  Ml  3  9PL 

Introduction 

For  the  first  time  in  a  British  Census  the  output  from  the  1991  Census  includes  the  Samples  of  Anonymised  Records, 
known  as  the  SARs.  The  SARs  are  individual  level  data  and  can  be  analysed  in  the  same  way  as  survey  data  using 
familiar  statistical  software  packages. 

The  datasets  have  been  purchased  jointly  by  the  Economic  and  Social  Research  Council  and  the  Information  Systems 
Committee  of  the  Universities  Funding  Council.  The  latter  has  now  been  superseded  by  the  Joint  Information  Systems 
Committee.  The  SARs  will  shortly  be  housed  at  the  Census  Microdata  Unit  at  Manchester  University  who  are  respon- 
sible for  dissemination  of  the  data  to  the  academic  and  non-academic  sectors. 

The  Availability  of  Census  Microdata  in  Other  Countries 

In  recent  years  there  has  been  a  growing  demand  in  most  industrialized  countries  for  individual  level  data  and  the  release 
of  samples  of  census  microdata  in  Great  Britain  follows  the  examples  set  by  the  United  States,  Canada  and  Australia. 

The  United  States  first  released  their  Public  Use  Microdata  Samples  in  the  1960s  and  these  hierarchical  files  of  housing 
units  and  persons  have  been  heavily  used  ever  since.  From  the  1990  census  the  samples  available  consist  of  a  5% 
county/county  equivalent  file,  a  1%  metropolitan  area  file  and  a  3%  sample  of  the  elderly.  Canada  has  released  census 
microdata  samples  since  the  1970s  and  this  time  are  producing  an  individual  file,  a  household  and  housing  file  and  a 
family  file.  The  1980s  saw  the  release  for  the  first  time  of  Ausu-alian  census  microdata  samples.  These  did  not  contain 
much  geographical  detail  but  negotiations  currently  in  progress  are  likely  to  lead  to  the  design  of  two  samples  from  the 
1991  census,  one  with  more  geographical  information  and  less  classification  detail  than  the  other. 

In  Great  Britain,  two  files  will  soon  be  available,  a  2%  individual  sample  and  a  1%  sample  of  households.  The  hierarchi- 
cal file  of  households  has  geographical  identifiers  only  for  the  standard  regions  of  Great  Britain  while  the  geographical 
identifiers  on  the  individual  file  are  for  areas  with  a  minimum  population  of  120,000  persons  (Marsh  1992).  Northern 
Ireland  SARs,  similar  to  those  of  Great  Britain,  should  be  available  early  in  1994. 

Of  coiuse  the  release  of  microdata  does  involve  some  risks  of  disclosure  of  personal  information  and  so  in  all  the  above 
examples  the  data  has  been  'anonymised'  to  reduce  the  risk  of  identifying  any  individual.  Some  variables  may  be 
suppressed,  categories  of  other  variables  may  be  grouped  together  and  perturbation  techniques  may  be  used. 

In  Britain  individual  level  data  is  also  available  in  the  OPCS  Longitudinal  study  where  microdata  for  the  1971,  1981  and 
1991  censuses  is  linked  together  with  data  drawn  from  the  National  Health  Service  Central  Register.  However,  access  to 
this  microdata  is  tightly  controlled  and  OPCS  release  tables  from  the  Longitudinal  Study  only  after  checking  the  tables 
thoroughly  for  any  breaches  of  confidentiality. 

In  general,  other  European  coundies  have  placed  more  resunctions  on  the  availability  of  census  microdata  but  in  many 
countries  the  policy  of  releasing  microdata  is  under  review.  Spain  and  Luxembourg,  for  example,  have  allowed  use  of 
microdata  only  by  government  and  not  by  academics.  Microdata  from  the  French  census  is  available  to  academics  under 
special  arrangements  which  may  include  working  on  the  premises  of  INSEE,  the  National  Statistical  Institute.  In  Italy, 
academics  can  access  microdata  by  special  request.  In  fact,  microdata  from  the  1981  Italian  census  was  used  by  re- 
searchers at  the  University  of  Newcastle  upon  Tyne,  in  collaboration  with  the  Regional  Government  of  Tuscany  and  the 
Italian  Census  Agency,  for  work  on  disclosure  risks.  This  research  proved  useful  when  British  academics  made  the  case 
for  the  release  of  the  SARs  in  Britain  (Marsh  et  al  1991). 


lASSIST  Quarterly 


In  Sweden  a  Freedom  of  Press  Act  allows  all  governmental  information  to  be  open  to  the  public,  subject  to  restrictions 
on  access  to  personal  infwmation,  and  academics  can  request  permission  to  access  census  microdata.  Microdata  in 
Norway  is  not  publicly  available  but  academics  can  obtain  access  to  anonymised  census  microdata. 

Neither  Denmark  nor  the  Netherlands  hold  a  conventional  census  (Langevin  1992).  Instead,  similar  information  is 
obtained  from  administrative  files  and  registers.  In  Germany,  recommendations  have  recently  been  made  for  release  of 
data  from  the  German  Microcensus,  a  survey  conducted  annually  with  1%  of  the  population.  These  recommendations 
involve  both  suitable  anonymisation  of  records  and  contractual  commitments  of  recipients  of  the  data  (Knoche  1992). 

Censuses  in  the  United  Kingdom 

In  the  British  Census,  census  forms  fw  England,  Scotland  and  Wales  differ  in  some  small  respects:  in  Wales  there  is  a 
Welsh  language  question,  in  Scotland  there  is  a  Gaelic  language  question  and  also  a  question  about  the  lowest  floor  level 
of  accommodation.  The  census  forms  are  all  processed  in  the  same  way  by  the  Office  of  Population  Censuses  and 
Surveys  and  the  General  Register  Office  for  Scotland. 

However,  in  Northern  Ireland  there  is  a  different  constitutional  framework  and  the  Northern  Ireland  census  is  the  respon- 
sibility of  the  Northern  Ireland  Civil  Service.  The  Northern  Ireland  Census  Office  is  part  of  the  Department  of  Health 
and  Social  Security.  Access  to  the  data  has  to  be  negotiated  separately  with  the  Northern  Ireland  authorities  and  partly 
for  this  reason,  Northern  Ireland  data  has  often  been  neglected  in  the  past. 

This  year  the  ESRC  is  currendy  funding  more  work  on  the  Northern  Ireland  Census.  At  Queen's  University,  Belfast, 
work  is  underway  to  produce  Local  Base  Statistics  and  Small  Area  Statistics  and  the  ESRC  is  also  negotiating  with  the 
Northern  Ireland  Census  Office  for  the  purchase  of  Northern  Ireland  S  ARs. 

Northern  Ireland  SARs 

At  the  time  of  writing,  the  statistical  specification  for  the  Northern  Ireland  SARs  is  almost  finalised  and  is  as  similar  as 
possible  to  that  of  the  British  SARS.  As  for  Great  Britain,  two  files  are  to  be  produced. 

(i)  The  first  is  a  2%  sample  of  individuals  at  private  addresses  and  residents  of  communal  establishments.  The  geo- 
graphical scheme  identifies  ten  areas,  chosen  so  that  the  population  in  each  area  is  greater  than  120,000.  The  SAR  areas 
are  amalgamations  of  District  Councils  with  similar  characteristics. 

(ii)  The  second  file  is  a  1  %  hierarchical  sample  of  households  and  the  individuals  in  those  households.  Here  the  geo- 
graphical identifier  is  just  that  of  Northern  Ireland  itself. 

Sampling  methods  are  similar  to  those  used  for  the  British  SARs  but,  while  the  British  samples  are  drawn  from  the  10% 
of  census  records  which  are  fully  coded,  in  Northern  Ireland  the  samples  are  drawn  from  the  100%  fully  coded  data. 

Confidentiality  issues  are  particularly  important  in  Northern  Ireland  because  of  the  risks  of  disclosure  of  information 
about  the  security  forces.  However,  since  there  is  no  fine  detail  geographical  information  in  the  SARs,  no  additional 
measures  to  reduce  the  risk  of  disclosure  were  considered  necessary.  The  same  broad-banding  of  categories  is  used  in 
both  the  Great  Britain  and  Northern  Ireland  SARs.  For  the  individual  sample,  as  in  the  Great  Britain  SARs,  officers  in 
the  armed  forces  are  grouped  with  PoUce  Officers,  Prison  Service  Officers,  Fire  Service  Officersand  Customs  and  Excise 
Officers.  Other  ranks  are  grouped  together  in  a  separate  category. 

Harmonisation  of  United  Kingdom  SARs 

Census  reports  for  the  whole  of  the  United  Kingdom  have  not  been  available  in  the  past  and  so  a  harmonised  United 
Kingdom  SAR  dataset  would  be  extremely  useful.  SAR  data  files  containing  data  from  both  Great  Britain  and  Northern 
Ireland  would  allow  such  reports  to  be  produced  easily. 

However  there  are  some  problems  in  producing  such  a  dataset  and  they  fall  into  the  following  categories: 

Differences  in  Questions  on  Census  Form 

The  starting  point  for  comparison  of  variables  in  the  Northern  Ireland  and  British  SARs  is  of  course  the  census  form. 

Questions  on  both  frams  are,  in  general,  worded  in  the  same  way  but  there  are  a  few  exceptions. 


FallAA/inter  1993 


The  wording  of  the  Qualifications  question  is  quite  different  In  Great  Britain  only  qualifications  normally  obtained  after 
the  age  of  18  are  asked  for  but  in  Northern  Ireland  qualifications  obtained  at  school  level  are  also  recorded.  The  subject 
of  the  highest  level  qualification  is  recorded  on  the  British  form  but  not  on  the  Northern  Ireland  form.  Other  small 
differences  in  the  wording  of  questions  do  not  present  serious  harmonisation  problems. 

In  Great  Britain  the  ethnicity  question  was  introduced  for  the  first  time  for  the  1991  Census  and  is  not  included  in  the 
Northern  Ireland  Census.  Northern  Ireland  have  some  additional  questions:  a  fertility  question  (to  be  answered  by  all 
married,  widowed,  separated  and  divorced  women),  an  Irish  language  question,  a  question  on  religion  and  additional 
amenities  questions  on  water  supply  and  domestic  sewage  disposal. 

The  question  on  religion  was  voluntary  and  was  introduced  partly  to  assist  monitoring  of  Northern  Ireland's  fair  employ- 
ment legislation  under  which  employers  of  25  or  more  people  have  to  register  to  say  how  their  workforce  is  balanced 
between  Catholics  and  Protestants. 

Differences  in  Coding 

Even  when  the  wording  of  questions  on  the  census  form  is  the  same,  the  way  in  which  information  is  coded  can  vary. 

Information  can  be  lost  when  being  transferred  fi^om  form  to  computer  at  the  coding  stage. 

An  example  of  this  arises  in  the  coding  of  the  relationship  to  head  of  household  question.  While  Great  Britain  codes  this 
using  16  categories,  Northern  Ireland  codes  the  answers  to  the  relationship  question  using  just  10  categories.  As  shown 
in  Table  1,  such  relationships  as  'Cohabitant  of  son  or  daughter'  or  'Boarder/lodger'  would  be  coded  separately  in  Great 
Britain  but  would  be  classed  as  'Other  unrelated'  in  Northern  Ireland. 


Table  1.  Differences  In  Coding 

Relationship  to  Head  of  Household 
G.B.                                             N.I. 

0  Head  of  Household 

0  Head  of  Household 

1  Spouse 

1  Spouse 

2  Cohabitant 

2  Cohabitant 

3  Son/daughter 

3  Son/daughter 

4  Child  of  cohabitant                                                                                                                I 

5  Son/daughte-in-law 

8  Son/daughte-in-law                                   1 

6  Cohabitant  of  son/daughter                                                                                               | 

7  Parent 

4  Parent 

8  Parent-in-law 

7  Parent©in©law 

9  Brother/sister 

5  Brother/sister 

10  Brother/sister-in-law 

11  Grandchild 

6  Grandchild                                                 Jl 

12  Nephew/niece                                                                                                                     || 

13  Other  related 

9  Other  related                                           | 

14  Boarder,  lodger  etc. 

15  Joint  head 

16  Other  unrelated 

10  Other  unrelated 

Differences  in  Definitions 

The  Relationship  to  head  of  household  is  used  to  identify  families  within  a  household  and  differences  exist  between  the 

census  definitions  of  the  family  used  in  Northern  Ireland  and  Great  Britain. 


■ASSIST  Quarterly 


In  both  censuses,  a  family  unit  has  a  maximum  of  two  generations  with  the  younger  generation  never  married  and  having 
no  partner  or  children.  There  is  no  age  hmit  for  a  child.  However,  in  Northern  Ireland  a  cohabiting  couple  is  not  re- 
garded as  a  family  unit  as  it  is  in  Great  Britain.  A  cohabiting  couple  would  therefore  be  classified  in  the  Northern 
Ireland  Census  as  'family  type'  not  applicable  or,  if  the  couple  had  children  as  a  'lone  parent  with  children,  with  others' 
family  type.  The  Great  Britain  definition  does  in  fact  conform  to  United  Nations  recommendations  for  1990  censuses  in 
the  European  Community  (UN  1987). 

Again,  the  definition  of  a  dependent  child  differs  in  the  two  censuses.  In  the  British  Census  a  person  age  16-18,  never 
married,  in  full  time  education  and  economically  inactive  is  regarded  as  a  dependent  child.  In  the  Northern  Ireland 
Census,  the  equivalent  age  banding  is  16-19.  However,  in  the  interests  of  harmonisation,  the  Northern  Ireland  Census 
Office  has  agreed  to  change  the  age  banding  to  16-18  for  the  Northern  Ireland  SARs  so  that  this  difference  in  definitions 
will  not  be  a  problem. 

Differences  in  Processing 

Clerical/Computerised  Allocation  of  Families 

In  Great  Britain  allocation  of  individuals  to  families  is  done  using  a  complex  computer  algorithm  whereas  in  Northern 

Ireland  the  number  of  families  in  a  household  is  determined  manually  when  coding  from  the  census  form. 

The  computer  algorithm  identifies  sixty  different  family  types  and  every  individual  in  the  household  isallocated  to  a 
family  with  its  particularfamily  number  and  family  type.  The  sixty  family  type  categories  are  reduced  to  tencategories 
for  the  SARs,  as  shown  in  Table  2. 


Table  2.  Great  Britain  SARS  Family  Type 

1 .  Married  Couple 

2 

3 

-  no  children 

-  dependent  child(ren) 

-  nonOdependent  child(ren) 

4.  Cohabiting  Couple 

5. 

6. 

-  no  children 

-  dependent  child(ren) 

-  non-dependent  child{ren) 

7.  Lone  parent 
8 

-  dependent  child(ren) 

-  non-dependent  child(ren) 

9.  Non-family  household 
10 

-  one  person 

-  two  or  more  persons 

It  is  worth  noting  that  the  computer  algorithm  cannot  identify  families  amongst  a  group  of  household  members  unrelated 
to  the  head  of  household  and  in  such  cases  the  census  forms  must  be  processed  manually. 

In  Northern  Ireland,  the  number  of  families  in  a  household  is  determined  at  the  coding  stage  by  examination  of  the 
census  form.  The  household  is  classified  as  a  non-family  household  or  as  one  of  the  household  types  shown  in  Table  3. 
The  Northern  Ireland  census  database  does  not  hold  information  about  which  family  an  individual  belongs  to. 


FallAWinter  1993 


Table  3.  Northern  Ireland  SARS  Household  Family  Type 

1. 

One-family  household 

-  married  couple,  no  children, 
no  others 

2. 

-  married  couple  ,  no  couples 
with  others 

3. 

-  married  couple,  with  children, 
no  others 

4. 

-  married  couple,  with  children, 
with  others 

5. 

-  lone  parent,  with  children, 
no  others 

6. 

-  lone  parent,  with  children, 
with  others 

7. 

-  lone  grandparent, 

with  grandchildren,  no  others 

8. 

-  lone  grandparent, 

with  grandchildren,  with  others 

9. 

Two-family  household 

-  related,  no  others 

-  related,  with  others 

-  not  related,  no  others 

-  not  related,  with  others 

13 

.  Three  of  more  family  household 

This  difference  in  processing  means  that  the  only  way  of  harmonising  family  type  is  to  use  the  Northern  Ireland  classifi- 
cation of  Household  Family  Type  and  to  derive  this  classification  for  Great  Britain  households  in  the  household  S  ARfile. 
In  the  Northern  Ireland  individual  sample,  variables  such  as  Social  Class  of  Family  Head  and  Economic  Position  of 
Family  Head  will  not  be  available.  Instead  similar  information  will  be  available  for  the  Head  of  Household. 

Calculations  of  Distance 

The  Great  Britain  SARs  hold  information  about  the  distance  travelled  by  people  to  work  and  also  on  the  distance  of 

move  of  migrants. 

Calculations  of  distance  to  work  are  carried  out  in  different  ways  for  the  two  censuses.  The  Central  Postcode  Directory 
used  in  Great  Britain  is  not  currently  used  by  the  Northern  Ireland  Census  Office  but  distances  are  calculated  using  the 
Northern  Ireland  grid  square  reference  marked  on  the  respondent's  census  form  and  the  grid  square  reference  of  the 
employer.  The  employer's  grid  reference  is  only  known  for  large  employers  employing  over  25  persons  and  so  the 
Northern  Ireland  SARs  will  hold  distances  to  work  for  people  working  for  large  employers  only.  The  distance  of  move 
of  migrants  is  found  using  the  postcode  of  the  previous  address  of  the  migrant  and  so  cannot  be  found  without  using  the 
Central  Postcode  Directory.  This  means  that  there  are  currently  some  unresolved  problems  in  including  this  in  the 
Northern  Ireland  SARs. 

Clerical/Computerised  Editing  and  Imputation  Procedures 

In  Great  Britain  a  computerised  editing  system  is  used  to  identify  inconsistencies  and  missing  values  in  the  data  and  then 
to  impute  valid,  consistent  answers  (Mills  1991).  In  Northern  Ireland,  only  clerical  processing  is  used.  Some  inconsis- 
tencies are  resolved  manually  at  the  coding  stage  and  clerical  procedures  give  rules  for  handling  missing  answers. 


12 


■ASSIST  Quarterly 


The  differences  between  these  two  approaches  can  be  seen  by  comparing  the  procedures  for  treating  missing  values  for 
the  number  of  cars  in  a  household.  In  Great  Britain  analysis  of  previous  census  results  shows  that  a  good  indication  of 
the  number  of  cars  in  a  household  is  given  by  the  number  of  people  in  the  household,  the  tenure  of  the  household,  and 
whether  the  accommodation  is  in  a  permanent  or  non-permanent  building. 

Missing  values  are  imputed  by  reference  to  these  other  factors.  However,  in  Northern  Ireland  the  missing  value  would 
be  coded  as  zero. 

This  means  that  when  making  comparisons  between  Northern  Ireland  and  Great  Britain,  it  is  important  to  imderstand  the 
effects  of  the  different  imputation  procedures. 

Conclusion 

Although  there  are  some  harmonisation  problems,  it  is  felt  that  the  production  of  a  harmonised  United  Kingdom  SAR 
dataset  would  be  of  great  benefit  to  census  data  users.  Since  there  is  litde  good  quality  data  available  fw  the  whole  of 
the  United  Kingdom,  the  UK  SARs  would  provide  a  new  research  resource  to  enable  work  to  be  carried  out  in  such 
areas  as  deprivation  and  discrimination. 

Even  within  the  United  Kingdom,  small  differences  in  census  definitions  and  processing  lead  to  difficulties  in  making 
direct  comparisons  of  some  social  characteristics  of  the  regions.  Hopefully,  the  increasing  adoption  of  international 
standards  will  reduce  the  amount  of  effort  required  to  be  spent  in  the  future  on  harmonisation  problems  and  result  in 
better  quality  comparative  statistics. 

References 

KNOCHE,  P  (1992).  Factual  Anonymity  of  Microdata  from  Household  and  Person  Related  Surveys.  The  Release  of 
Microdata  Files  for  Scientific  Purposes.  Proceedings  of  International  Seminar  on  Statistical  Confidentiality,  Dublin, 
organised  by  EUROSTAT  and  ISI. 

LANGEVIN,  B;  BEGEOT,  F;  PEARCE,  D  (1992).  Censuses  in  the  European  Community.  Population  Trends,  68. 

MARSH,  C;  SKINNER,  C;  ARBER,  S;  PENHALE,  B;  OPENSHAW,  S;  HOBCRAFT,  J;  LIEVESLEY,  D  &  WAL- 
FORD,  N  (1991).  The  case  for  samples  of  anonymised  records  from  the  1991  census.  Journal  of  the  Royal  Statistical 
Society(A),  154,2. 

MARSH,  C;  TEAGUE,  A  (1992).  Samples  of  anonymised  records  from  the  1991  Census.  Population  Trends,  69. 

MILLS,  I;  TEAGUE,  A  (1991).  Editing  and  imputing  data  for  the  1991  Census.  Population  Trends,  64. 

UN  (1987).  Recommendations  for  the  1990  Censuses  of  Population  and  Housing  in  the  ECE  Region.  United  Nations 
Statistical  Commission  for  European  Statisticians,  Statistical  Standards  and  Studies,  40. 

1.  Presented  at  lASSIST/IFDO  93  Conference  held  in  Edinburgh,  Scotland.  May  1993. 


FallA/Vinter  1993  13 


Partitioned  Distributed  Computing 

by  George  Yates ' 

Social  Science  &  Public  Policy  Computing  Center 

University  of  Chicago 


Overview 

The  Social  Science  and  Public  Policy  Computing  Center 
(SSPPCC)  at  the  University  of  Chicago  was  founded  in 
February,  1990,  to  provide  general  computing  services  to 
the  University's  social  science  research  and  education 
community.  At  the  time.  Sun  Microsystems,  Inc.,  had 
recently  started  shipping  the  SPARC  Station  1  UNIX 
workstation  with  a  RISC  architecture  which  presaged  a 
new  class  of  small  computer  cost-performance.  DEC, 
IBM,  SGI,  HP,  and  other  UNIX  workstation  vendors 
followed  suit  over  the  next  18  months,  releasing  RISC 
workstation  systems  in  a  rapidly  improving  cost-perform- 
ance trend.  Effective  industry  standards  for  hardware  and 
software  facilities  significantly  unified  the  emerging 
technology  in  all  these  releases.  Today,  small  computer 
systems  exhibit  scalar  CPU  performance  and  data  storage 
performance  and  capacity  that  rival  or  exceed  mainframe 
levels  at  a  small  fraction  of  the  cost  Third-party  software 
and  hardware  developers  provide  extended  facilities  that 
operate  on  most  or  all  of  the  major  vendors'  systems  and 
a  network  of  such  systems  can  cooperate  to  automatically 
distribute  the  community-wide  process  load.  In  effect,  a 
community  of  users  can  now  build  a  large-scale  comput- 
ing environment  from  parts  supplied  by  an  entire  industry 
of  competing  vendors. 

SSPPCC's  technical  strategy  for  large-scale  computing  in 
the  social  sciences  is  based  on  a  pure  distributed  comput- 
ing model  designed  as  a  multi-vendor  network  of  cooper- 
ating UNIX  workstations  that  automatically  share  job 
load  submitted  at  any  site  in  the  local  UNIX  network 
domain.  The  design  is  being  implemented  using  a  broad 
selection  of  hardware  and  software  technology  and 
investigative  system  administration.  It  is  important  to 
briefly  describe  the  current  computing  environment  to 
establish  the  scale  of  contemplated  services. 

SSPPCC's  current  large-scale  computing  environment 
contains  16  high-performance  UNIX  workstations  from 
HP,  Sun,  and  IBM,  equipped  with  608  MB  of  RAM  from 
HP,  Sun,  IBM,  Technology  Works,  Kingston,  and 
Clearpoint.  Peripherals  include  40    GB  of  disk  space 
from  HP,  Seagate,  IBM,  and  Maxtor,  plus  user-operated 
nine-track  and  DAT  tape  drives  from  HP  and  cartridge 
tape  and  CD-ROM  drives  from  IBM  and  Sun.  This  UNIX 
environment  is  accessed  from  360  networked  IBM/PC- 


clone  OT  Apple  Macintosh  desktop  computers  and  Wyse 
or  Qume  terminals.  Today,  active  networking  technolo- 
gies are  Ethernet  and  AppleTalk  driven  by  electronics 
from  Cabletron,  Cayman  Systems,  and  Farallon.  FDDI 
technology  is  scheduled  for  testing  in  July,  1992,  using 
concentrators  from  DEC,  Cabletron,  Ungermann-Bass, 
and  National  Peripheral  Devices. 

The  processing  load  on  SSPPCC  UNIX  workstations 
originates  from  1,400  login  accounts,  with  50-100  active 
UNIX  logins  at  mid-afternoon.  Research  software  is 
primarily  data  management  and  statistical  analysis  tools 
including  SAS,  SPSS,  BMDP,  Stata,  MatLab,  LimDep, 
GLIM,  HLM,  and  Mathematica,  plus  custom  research 
software  in  Fortran  and  C.  Commonly  used  software  is 
implemented  on  all  machine  types.  Aji  offline  tape 
library  contains  over  2,500  titles,  including  Census,  CPS, 
PSID,  NLS,  NELS,  HS&B,  SIPP,  and  IMF. 

The  entire  assembly  is  growing  rapidly  in  all  dimensions, 
constrained  mostly  by  budget  and  management  issues. 
ImportanUy,  growth  is  tightly  organized  by  a  comprehen- 
sive technical  design  for  distributed  computing.  Primary 
distributed  computing  concepts  addressed  in  the  current 
SSPPCC  design  are: 

1 .  Networking  strategies, 

2.  Location-brokering, 

3.  Distributed  File  Systems, 

4.  Hierarchical  Storage  Schemes, 

5.  File  Migration  Service. 

Implementation  of  these  design  concepts  is  in  different 
stages.  SSPPCC  networking  strategies  are  a  careful 
design  for  partitioning  network  traffic  so  that  the  highest 
loads  are  on  the  fastest  links.  As  mentioned  above,  high- 
speed FDDI  is  scheduled  for  testing  soon;  even  faster 
HPPl  technology  is  awaiting  product  announcement  for 
UNIX  workstations  in  late  \992.  Location-brokering 
automatically  locates  job  execution  on  a  workstation  that 
is  most  suited  lo  the  job  load.  HP's  Task  Broker  software 
is  specified  to  provide  location-brokering  services.  Task 


■ASSIST  Quarterly 


Broker  has  been  extensively  tested,  is  in  limited  use  on 
HP  and  Sun  workstations,  and  will  be  in  common  service 
by  Fall,  1992.  Distributed  file  systems  (Sun's  NFS  and 
Transarc's  AFS)  have  been  extensively  tested  on  all 
platforms.  NFS  is  extensively  used;  AFS  is  implemented 
in  a  limited  system  role  on  all  machines,  to  be  in  common 
user  service  by  Summer,  1992.  Hierarchical  storage 
schemes  that  provide  very  large  capacity  storage  are  well- 
researched  but  untested  by  SSPPCC.  (An  R-Squared 
system  that  runs  the  Sybase  relational  DBMS  server  on 
hierarchical  storage  is  in  use  today  at  one  SSPPCC  client 
site  and  represents  a  limited  test)  Storage  schemes  being 
considered  are  implementations  of  the  EEEE  Mass 
Storage  System  Reference  Model  (MSSRM)^from  two 
systems  integrators — Advanced  Computing  Support 
Center  (ACSC)  and  Epoch.  An  implementation  of  the 
MSSRM  storage  server  from  one  of  these  vendors  will  be 
the  primary  strategy  for  large  data  base  handling  in  the 
SSPPCC  design.  Both  of  these  implementations  include  a 
file  migration  service  which  automatically  removes  and 
replaces  files  on  local  workstation  disks  as  required  by 
local  filesystem  usage.  Either  ACSC's  UniTree  on  IBM's 
RS/6(XX)  workstation  or  Epoch's  Rennaisance  Migration 
Service  (RMS)  on  Sun's  SPARC  Station  (to  be  ported  to 
the  RS/6000  in  1992)  will  be  implemented  in  1992, 
budget  permitting. 

This  article  describes  the  SSPPCC  design  for  the  distrib- 
uted computing  functions  listed  above.  For  ease  of 
expression,  the  perspective  in  the  following  text  is  that  all 
services  are  operating.  After  details  of  the  design  are 
developed,  they  will  provide  context  for  further  discus- 
sion of  design  implementation  stages  at  Chicago.  This 
article  concludes  with  consideration  of  design  implemen- 
tation on  a  national  scale. 

Networking  Strategies  and  Location-Brokering 

In  distributed  computing  across  a  network,  a  job  is  not 
necessarily  executed  at  its  submittal  site.  Instead,  the 
networked  computers  cooperate  (unseen  by  the  user)  to 
locate  the  job's  execution  at  a  machine  deemed  most 
suitable  for  the  job  load,  considering  the  job's  characteris- 
tics and  the  current  load  at  all  sites.  Such  cooperative  job 
placement  is  called  location-brokering.  Since  social 
science  research  jobs  are  frequently  I/0-bound  and  a 
relocated  job  might  execute  remotely  from  the  disks 
containing  required  data  files,  efficient  communication  to 
ship  data  to  a  job's  execution  site  is  essential  to  realizing 
the  gains  of  a  distributed  computing  strategy. 

SSPPCC's  design  specifies  a  physical  network  structure 
that  allows  strategic  choices  for  permanent  file  location 
and  job  execution  location  so  that  any  required  file 
shipment  on  the  network  uses  a  communications  link 
whose  speed  is  appropriate  for  the  file  size.  Ideally,  larger 
files  are  shipped  on  faster  links.  Thus,  network  structure 


and  location-brokering  are  coupled  concepts  and  are 
discussed  together  in  this  section. 

The  specified  network  structure  is  a  layering  of  four 
standardized  communications  technologies  into  a 
hierarchy  of  links  with  graduated  speeds.  Specified 
technologies  and  nominal  link  speeds  are: 

1.  LocalTalk— 0.275  Mbits/sec.  Used  in 
SSPPCC  networks  of  Apple  Macintosh  computers  for 
very  low-volume  traffic  in  printing,  low-volume  file- 
sharing,  and  messaging  during  sessions  with  E-mail  and 
login  servers. 

2.  Ethernet — 10  Mbits/sec.  The  most  com- 
mon technology  for  networks  of  desktop  computers  and 
UNIX  workstations.  Used  for  low-volume  traffic  includ- 
ing print  files,  login  and  E-mail  session  messages,  small 
data  files,  and  process  control  messages. 

3 .  Fiber-Distributed  Data  Interface  (FDDI>— 
100  Mbits/sec.  The  fastest  technology  commonly 
available  for  local-area  networking.  Being  tested  by 
SSPPCC  for  use  as  the  initial  primary  medium  for  file- 
sharing  in  the  distributed  computing  net 

4.  High  Performance  Parallel  Interface 
(HPPI) — 800  Mbits/sec.  The  most  recent  technology  in 
practical  use  for  inter-computer  communications. 
Proposed  for  use  in  the  SSPPCC  design  for  very-large 
file  shipment.  HPPI  has  a  high-speed  variant  that  doubles 
the  data  path  width  to  achieve  nominal  1 ,600  Mbits/sec. 
This  variant  is  not  currently  announced  for  implementa- 
tion in  workstation  connections. 

In  the  parlance  of  the  OSI  networking  model,  the  Link 
Layer  protocol  standard  implemented  for  each  of  these 
technologies  is  IEEE  802.2  (LLC)  at  the  service  inter- 
face. Thus,  though  each  Physical  Layer  specification  is 
radically  different,  software  control  of  these  foui  tech- 
nologies can  be  uniform  above  layer  2.  In  particular,  the 
layer  3  Internet  Protocol  (IP)  is  implemented  on  each  of 
these  technologies. 

Figure  1 ,  Schematic  Representation  of  Conection 
Domains,  on  the  next  page  is  a  schematic  diagram  of 
SSPPCC's  design  for  inter-computer  connectivity.  It 
shows  a  network  connection  geometry  partitioned  by  the 
four  networking  technologies  into  a  hierarchy  of  link- 
speed  domains.  (The  connection  topology  diagrammed  in 
each  domain  is  a  star,  consistent  with  SSPPCC  cable 
plant  design  and  physical  layer  requirements.)  Describing 
each  of  the  labelled  elements  in  Figure  1  as  sites  of  com- 
puterized functions  and  services  is  the  goal  of  the  re- 
maining discussion. 


FallAVinter  1993 


15 


Communicalions 
Technologies 


IBM  Token  Ring 


HPPI 
Domain 


800  Mbits/sec 


FDDI 
Domain 


100  Mbits/sec 


DaU 
Storage 
Server 


with 
128  MB  RAM 


with 
128  MB  RAM 


"ET 
Cisco 
AGS+ 
Router 


Project 
Servers 


Ethernet 
Domain 


10  Mbits/sec 


Cabletron 
Tcken 
Ring 
Hub 


File  Server 
Control 
Chaimel 


FUe 
Servers 


IBM  PC  clone 


LocaltaUc 
Domain 


0.275  Mbits/sec 


Primary 
Servers 


Spill 
Servers 


IBM  PC  clone         IBM  PC  clone 


Login 
Server 


Apple  Mac  II 


Apple  Mac  II  Apple  Mac  II 


Apple 
Mac 
Plus 


Apple 
Mac 
Plus 


■ASSIST  Quarterly 


Specialized  hardware  and  software  is  specified  in  each 
domain  as  required  to  support  specialized  services 
provided  in  the  domain.  The  HPPI  domain  contains  IBM 
RS/6000  File  Server  and  Storage  Server  systems  special- 
ized for  large  data  base  handling.  IBM  and  third-party 
developments  for  RS/6000  POWER  architecture  and 
intrinsic  AIX  3.2  I/O  services  are  essential  to  the  high- 
performance  I/O  required  in  the  HPPI  domain.  The  FDDI 
domain  contains  HP  and  Sun  Project  Server  systems 
intended  for  general  research  computation.  HP  700  series 
workstations  currendy  provide  the  fastest  CPU  perform- 
ance, the  strongest  upgrade  path,  and  the  best  cost- 
performance  in  the  workstation  industry;  Sun  SPARC 
Stations  are  the  most  common  and  most  inexpensive 
software  development  platforms.  The  Ethernet  domain 
contains  Login  Servers  and  desktop  computers  for  small- 
scale  computation  and  office  services  required  by  the 
largest  user  group  (plus  the  control  channel  for  location- 
brokering).  It  is  the  domain  for  older  UNIX  systems,  like 
the  HP  8x5  series,  which  has  reduced  research  computing 
value  but  supports  Task  Broker  and  is  supported  by  a 
strong  maintenance  organization.  The  LocalTalk  domain 
supports  small-scale  functions  using  Apple  Macintosh 
computers  and  printers. 

Common  social  science  research  activity  is  naturally 
suited  to  the  diagrammed  structure.  For  example,  in  the 
most  common  social  science  paradigm  for  research  data 
file  construction,  a  large  file  is  read  to  extract  a  smaller 
research  file  for  subsequent  processing  by  a  variety  of 
statistical  software.  In  the  SSPPCC  environment,  source 
file  sizes  are  typically  100-1,000  MB  and  extracted  file 
sizes  are  typically  10-100  MB.  This  paradigm  suggests  a 
distributed  computing  scheme  in  which  extraction  jobs 
are  located  in  a  hardware  and  software  domain  specialized 
for  large  data  base  handling  and  statistical  analyses  are 
located  in  a  domain  specialized  for  research  computation. 
Specifically,  extraction  jobs  should  be  located  in  a 
domain  with  fast  network  links  and  large,  fast  disks,  like 
the  HPPI  domain,  while  statistical  analyses  should  be 
located  in  a  domain  with  lower  emphasis  on  large  file 
handling  and  greater  emphasis  on  multi-tasked  CPU 
performance  and  broad  software  support,  like  the  FDDI 
domain. 

As  stated  above,  the  primary  principle  for  configuring  the 
layered  network  specifies  that  a  file  should  be  located  in  a 
domain  where  services  are  specialized  for  efficient 
handling  of  the  file  size.  The  assignment  of  file  sizes  to 
domains  is  roughly  calibrated  by  requiring  thai  job 
relocation  by  location-brokering  minimally  impact  job 
throughput  in  the  domain  of  Project  Servers  where  the 
bulk  of  research  computing  is  done.  As  already  explained, 
relocating  a  job  to  a  machine  remote  from  the  disks 
holding  its  data  introduces  network  communications 
overhead  into  job  I/O.  By  specifying  a  data  communica- 


tions technology  for  the  Project  Server  domain  whose 
effective  data  transfer  rate  dominates  that  from  disk-to- 
host,  communications  overhead  will  not  slow  job 
processing  because  the  slowest  I/O  transfer  will  then 
occur  in  the  disk-to-host  step.  The  calibration  emerges 
from  this  rationale. 

Project  Servers  perform  statistical  analyses  on  data  files 
typically  sized  from  10  to  100  MB.  These  workstations 
are  equipped  with  a  variety  of  SCSI  I/O  interfaces 
connected  to  a  variety  of  data  disks.  SCSI-2  performance 
is  typical  of  current  disk  equipment  The  effective 
transfer  rate  of  SCSI-2  implementation  is  about  0.7  MB/ 
sec.  Effective  Ethernet  rates  are  about  0.2  MB/sec  while 
effective  FDDI  rates  are  about  2  MB/sec.  Hence,  FDDI 
rates  should  dominate  Project  Server  disk  rates  even 
when  the  FDDI  net  is  under  load.  Thus,  FDDI  technol- 
ogy is  specified  for  fde-sharing  between  Project  Servers 
so  Figure  1  shows  the  Project  Servers  in  the  FDDI 
domain.  Since  the  typical  research  data  file  size  is 
10-100  MB,  we  conclude  that  fdes  larger  than  100  MB 
should  be  located  in  the  HPPI  domain,  the  FDDI  domain 
should  hold  files  in  the  10-100  MB  range,  and  the 
Ethernet  domain  should  hold  fdes  smaller  than  10    MB. 

It  should  be  clear  now  that  SSPPCC's  distributed 
computing  design  does  not  view  the  workstation  systems 
and  network  hnks  as  an  assembly  of  homogeneous 
hardware  and  software  capabilities  and  functions. 
Perhaps  a  more  descriptive  term  for  the  design  specifica- 
tions is  partitioned  distributed  computing.  To  understand 
how  location-brokering  distributes  job  execution  into  the 
partitions  defined  by  file-size  constraints,  some  details  of 
the  job  relocation  strategy  are  needed. 

Referring  to  Figure  1 ,  all  Project  Servers  and  Login 
Servers  run  location-brokering  software  called  Task 
Broker,  Hewlett-Packard  software  implemented  for  HP 
and  Sun  workstations.  Local  systems  staff  configures  a 
Task  Broker  daemon  running  on  each  machine  to  broker 
the  execution  site  for  selected  UNIX  commands.  When  a 
user  submits  a  brokered  command,  the  submittal  site, 
called  the  client  (a  Login  Server,  for  example),  broad- 
casts the  command,  its  parameters,  and  the  user  ID  to  the 
community  of  Task  Broker  servers  connected  to  the  Task 
Broker  Arbitration  Channel  in  the  Ethernet  Domain  (see 
Figure  1).  Each  server  independently  evaluates  its  current 
load,  the  characteristics  of  the  command,  and  the  user  ID, 
summarizing  its  willingness  to  accept  execution  of  the 
command  by  reporting  an  affinity  integer  in  the  range 
0-999  back  to  the  client  Affinity  calculations  are  pro- 
grammed by  systems  staff  specific  to  the  server  and  the 
command.  The  client  (which  might  also  be  a  server) 
awards  job  execution  to  a  server  reporting  the  highest 
non-zero  affinity.  Remote  file  systems  needed  during 
execution  and  unavailable  to  the  server  are  then  NFS 


FallAWnter  1993 


17 


auto-mounted  and  the  job  is  dispatched  by  the  daemon's 
invocation  of  the  Task  Broker  service  script  on  the  server. 
When  the  server  completes  the  job,  the  service  script 
automatically  returns  products  to  the  client,  subject  to  the 
file-size  constraints  of  the  client's  domain.  The  client 
notifies  the  user  of  job  completion  by  a  screen  display  if 
the  user's  submittal  login  is  still  active,  and  by  electronic 
mail  if  it  is  not  active.  The  user  might  never  know  which 
workstation  actually  executed  the  job.  The  location- 
brokering  process  is  thus  a  client  auction  of  job  execution 
to  the  servers. 

Reporting  zero  affinity  constitutes  a  server's  refusal  to 
accept  assignment  Affinity  programs  generate  zero  if  the 
brokered  command  software  is  not  installed  on  the  server, 
if  the  server  is  not  configiued  to  run  the  command  as 
parameterized,  if  the  server  is  too  busy  to  accept  the  job, 
or  if  the  user  is  restricted  from  using  the  server.  If  all 
responding  servers  bid  zero,  the  job  is  queued  on  the 
client  for  a  fixed  period  (or  until  a  server  notifies  the 
cUent  of  another  brokered  job's  termination),  whence  the 
client  opens  another  auction. 

Given  this  job-relocation  strategy,  significant  characteris- 
tics of  the  distributed  computing  operation  are  determined 
by  affinity  calculations.  The  SSPPCC  design  specifies 
that  affinity  calculations  must  follow  certain  rules.  The 
important  rules  are  discussed  in  the  following  paragraphs. 

When  a  project  registers  with  SSPPCC  for  computing 
services,  it  is  assigned  a  Project  Server  based  on  informa- 
tion gathered  in  an  interview  of  project  staff  In  SSPPCC 
parlance,  if  a  research  project's  permanent  data  files  are 
located  on  disks  connected  to  machine  M,  the  project  is 
local  to  machine  M.  The  primary  rule  of  affinity  calcula- 
tion specifies  that  a  server  must  calculate  an  affinity 
biased  upward  for  a  command  submitted  by  a  local 
project;  i.e.,  if  the  command  parameters  indicate  that 
required  data  reside  on  local  disks,  a  server  adjusts  its 
affinity  upward,  subject  to  its  ambient  load  conditions. 
Because  of  this  rule,  there  is  a  systematic  bias  to  locate 
job  execution  where  the  project  is  located,  and  thus  a 
systematic  bias  against  saturating  the  network. 

There  are  two  classes  of  Project  Servers,  called  Primary 
Servers  and  Spill  Servers.  Superior  performance  of  HP 
700  series  workstations  compared  to  Sun  SPARC  Stations 
make  them  the  preferred  sites  for  project  locations,  so  HP 
700  series  machines  are  Primary  Servers.  Primary  Servers 
adjust  their  affinity  for  a  non-local  project's  command 
downward;  hence  they  are  biased  to  prefer  local  project 
commands  and  to  disdain  requests  for  non-local  project 
service.  Spill  Servers  calculate  affinities  between  the 
preference  and  disdain  ranges  of  the  Primary  Servers. 
Thus,  as  a  very  active  project  crowds  its  local  Primary 
Server  with  excess  load,  the  load  migrates  fu-st  onto  the 


Spill  Servers  before  crowding  other  project  activity  on 
ftimary  Servers.  Sun  SPARC  Stations  are  specified  as 
the  Spill  Servers  because  Spill  Servers  must  ideally 
accept  job  load  originating  from  any  site  in  the  domain. 
Since  Sun  workstations  are  the  most  active  and  economi- 
cal software  development  platform,  all  software  used  in 
the  SSPPCC  domain  is  implemented  on  them  at  the 
lowest  cost 

To  preserve  the  file-size  domains  described  above,  a 
server  will  bid  zero  for  any  command  requiring  a  remote 
file  larger  than  the  range  assigned  to  its  domain.  Hence,  a 
Login  Server  in  the  Ethernet  domain  will  bid  zero  for  any 
command  requiring  a  remote  file  larger  than  10  MB.  A 
Project  Server  will  bid  zero  for  any  command  requiring  a 
remote  file  larger  than  100  MB.  Thus,  projects  with 
permanent  files  larger  than  100  MB  must  be  located  on 
the  File  Servers  in  the  HPPI  domain. 

The  restriction  of  large  file  extractions  to  the  HPPI 
domain  emerges  naturally  in  this  structure.  A  user 
submitting  a  job  from  anywhere  in  the  network  to  extract 
data  from  a  source  file  larger  than  100  MB  must  refer- 
ence the  source  file  in  his  command  syntax.  All  servers 
below  the  HPPI  domain  will  bid  zero  for  the  job.  Only 
the  File  Servers  will  participate  in  the  auction  and  the  ex- 
traction will  be  awarded  to  a  File  Server  currently  suited 
for  the  load,  with  preference  for  the  Server  where  the 
source  file  resides.  File  Servers  are  equipped  with  very 
high-performance  disks  for  the  huge  I/O  tasks  assigned  to 
the  HPPI  domain.  Since  HPPI  channel  transfer  rates 
dominate  even  these  disk  transfer  rates,  job  location  on  a 
File  Server  remote  from  the  data  will  not  impact  job 
throughput.  Thus,  HPPI  domain  job  re-location  operates 
as  in  the  other  domains  and  job  execution  suited  for  the 
HPPI  domain  is  automatically  located  there.  This  ex- 
ample is  canonical  for  partitioned  distributed  computing. 

Importantly,  Task  Broker  is  not  implemented  for  IBM 
RS/6000's  specified  as  File  Servers  in  the  HPPI  domain. 
This  problem  is  circumvented  by  designating  an  HP  750 
as  Task  Broker  Proxy  for  the  File  Servers  (see  Figure  1). 
In  an  auction  run  by  a  client  outside  the  HPPI  domain, 
this  proxy  machine  calculates  its  own  bid  and  direcdy 
invokes  affinity  calculations  on  each  File  Server  via  the 
token  ring  control  channel  using  standard  UNIX  RPC 
facilities.  The  File  Servers  report  their  affinities  back  to 
their  proxy.  The  proxy  then  reports  the  highest  bid  to  the 
client.  If  the  client  awards  job  execution  to  the  proxy  and 
the  high  bidder  was  a  File  Server,  the  job  is  dispatched 
on  the  File  Server  by  the  proxy,  again  using  RPC  over 
the  token  ring  channel.  Conversely,  commands  directly 
submitted  to  a  File  Server,  so  that  tlie  File  Server  is  the 
Task  Broker  client,  are  brokered  simply  by  the  File 
Server's  RPC  execution  of  the  same  command  on  the 
proxy  machine.  The  HP  750  functions  as  proxy  client  in 


■ASSIST  Quarterty 


this  case,  auctioning  the  job  as  if  the  user  had  typed  the 
command  directly  to  it 

It  should  be  clear  that  using  Task  Broker  to  make  sophis- 
ticated judgments  about  job  placement  on  a  per-job  basis 
requires  significant  program  development  by  local  system 
staff.  Task  Broker  as  shipped  is  primarily  the  skeleton  for 
implementing  distributed  computing  strategies. 

By  the  above  strategies,  jobs  submitted  to  any  UNIX 
machine  in  any  domain  are  automatically  located  on 
workstations  suited  to  the  loads.  Each  machine  loses  its 
stand-alone  identity  and  contributes  its  CPU  and  I/O 
resources  as  subsystems  to  a  common  pool  of  computing 
services.  Such  a  distributed  computing  structure  is  called 
a  multicomputer.  In  SSPPCC's  Task  Broker  implementa- 
tion, the  multicomputer  network  is  guarded  against 
saturation,  local  projects  are  shielded  from  remote 
extravagant  projects,  and  job  load  is  located  in  domains 
specialized  for  job  characteristics. 

The  File  Server  Control  Channel  shown  in  Figure  1  is  not 
just  a  Task  Broker  arbitration  channel  for  communication 
with  the  Proxy.  Another  primary  function  of  this  channel 
involves  the  HPPI  domain's  Storage  Server  which 
maintains  a  large  data  library  accessed  via  the  network  by 
all  workstations  acting  as  Storage  Server  clients.  In 
particular.  File  Servers  access  very  large  data  files  located 
in  the  library  via  HPPI  Unks.  HPPI  is  a  point-to-point 
technology,  so  for  each  HPPI  packet  transferred  from  the 
Storage  Server  to  a  File  Server,  the  single  HPPI  interface 
of  the  Storage  Server  must  be  committed  to  a  client 
connection  established  through  the  Network  Systems 
Corp.  HPPI  switch.  In  order  to  reserve  the  Storage 
Server's  single  HPPI  data  channel  for  block  transfers, 
each  File  Server  client  must  communicate  its  request  for 
the  next  data  block  to  the  Storage  Server.  The  Server 
queues  these  requests  and  ships  the  blocks  as  they  arrive 
from  its  peripheral  devices.  The  HPPI  channel  is  reserved 
by  requests  on  the  File  Server  Control  Channel.  It  also 
serves  as  the  file  migration  channel  for  the  Ethernet 
domain  via  the  Cisco  router.  (The  Storage  Server  and  file 
migration  services  are  described  in  detail  below.) 

The  description  of  Figure  1  is  completed  by  remarking 
that  networked  desktop  PC  and  Macintosh  computers 
typically  access  the  multicomputer  via  the  Login  Server. 
Since  a  user  might  never  know  which  workstation 
actually  executed  his  job,  a  Lx)gin  Server  appears  to  the 
user  as  a  machine  capable  of  running  every  installed 
program  on  files  of  arbitrary  size  with  robust  throughput 
under  load.  In  other  words,  the  Login  Server  appears  as  a 
powerful  mainframe.  But  a  multicomputer  enjoys  distinct 
and  profound  advantages  over  a  mainframe: 


1. 


It  is  dramatically  cheaper,  costing  less  than 


10%  of  mainframes  with  similar  capacity. 

2.  It  is  out-of-service  only  in  the  event  of 
central  power  failure.  The  redundancy  of  the  independent 
systems  is  a  virtual  guarantee  against  total  system 
downage. 

3.  It  is  indefmiiely  extendible  to  accomodate 
increased  load  at  relatively  low  cost  Adding  new 
machines  (ot  new  sub-nets  in  any  of  the  technologies)  is 
a  simple  network  expansion.  New  machines  are  pur- 
chased at  workstation  pricing. 

4.  It  does  not  obsolesce  in  the  usual  sense. 
New  nodes  at  the  current  level  of  technology  can  be 
continuously  added.  Older  nodes  are  not  necessarily 
replaced;  they  participate  in  the  client  auction  with 
diminished  bids.  Furthermore,  new  nodes  are  dramati- 
cally lower-priced  than  large  platform  upgrades.  Up- 
grades can  be  small  specialized  increments  tightly 
correlated  with  demand.  Institutional  budget  planning  for 
multi-million  dollar  upgrades  every  5  to  10  years  is  never 
necessary. 

5.  There  is  no  hardware  vendor  lock-in. 
Hardware  upgrade  is  not  constrained  by  a  single  vendw's 
upgrade  calendar.  New  nodes  can  be  acquired  from  any 
vendor  able  to  satisfy  a  specification  of  formal  system 
requirements.  To  connect  to  the  multicomputer,  a  work- 
station must  run  required  networking  services  (ARPA- 
Berkeley,  NFS,  and  AFS).  All  major  vendors  satisfy 
these  requirements.  Lack  of  location-brokering  software 
is  circumvented  by  use  of  the  proxy  strategy  described 
above. 

Indeed,  the  popularization  of  workstations  and  the 
resulting  potential  for  propagation  of  multicomputers  has 
profound  implications  for  the  future  of  mainframes. 

Because  the  multicomputer  is  a  shared  centrally-sup- 
ported instrument  like  a  mainframe,  it  also  has  advan- 
tages over  decentralized  solutions.  Arguments  for  decen- 
tralization are  constrained  by  lower-powered  hardware, 
inequities  in  the  availability  of  services  to  a  large  group, 
and  the  redundancy  of  relegating  to  local  sites  all  the 
burdens  of  staff  expertise,  system  maintenance  and 
upgrade,  planning,  and  software  acquisition  and  support. 
As  administered  by  SSPPCC,  the  multicomputer  avoids 
all  these  evils. 

Distributed  File  Systems 

In  networked  file-sharing,  a  directory  structure  residing 
on  a  disk  connected  to  a  remote  machine  (the  fde-sharing 
server)  is  made  available  to  a  local  machine  (the  file- 
sharing  client)  as  if  it  resided  on  a  disk  directly  con- 
nected to  the  client.  NFS  and  AFS  are  the  file-sharing 


Fall/Winter  1993 


services  specified  by  the  SSPPCC  design.  The  two 
facilities  are  not  equivalent  They  differ  markedly  in  their 
strategies  fw  rec<wd-level  I/O,  their  strategies  for  file 
security  and  data  integrity,  and  their  system  management 
facilities.  Additionally,  today  NFS  is  a  facility  bundled 
with  every  major  vendor's  UNIX  software  so  any  ma- 
chine can  be  an  NFS  server  or  clienL  In  contrast,  the  AFS 
server  facility  must  be  separately  purchased  from 
Transarc  for  each  server,  hence  only  designated  machines 
can  be  AFS  servers  today.  Because  AFS  4.0  is  the 
Distributed  File  System  of  the  OSF  DCE  which  enjoys 
wide  vendor  acceptance,  the  AFS  server  facility  might 
one  day  be  as  ubiquitous  as  NFS  without  additional  cost. 

In  the  SSPPCC  design,  at  least  one  AFS  database  server 
is  specified  in  each  of  the  Ethernet,  FDDI,  and  HPPI 
domains.  Al!  workstations  in  the  multicomputer  are  AFS 
clients.  All  AFS  database  servers  are  referenced  by  all 
AFS  clients,  hence  files  in  every  AFS  volume  are  avail- 
able to  all  machines  in  the  multicomputer,  subject  to 
ordinary  data  security  protections.  Except  as  noted  in  the 
next  paragraph,  AFS  database  servers  ordinarily  hold 
only  relatively  fixed  files  such  as  software  libraries  and 
social  science  data  base  archives.  Users  can  request 
private  AFS  volumes  with  specified  quota  to  hold  user 
files  accessible  by  any  Task  Broker  server. 

Location-brokering  invdces  NFS  services  whenever  an 
existing  file  is  otherwise  unavailable  to  a  Task  Broker 
server.  New  fdes  created  by  a  re-located  job  are  never 
allocated  in  an  NFS-mounted  filesystem  during  job 
execution  because  writes  by  an  NFS -client  process 
addressed  to  a  remote  filesystem  are  slow — they  must 
complete  on  the  NFS -server  disk  before  IA3  completion  is 
signalled  to  the  NFS-cUent  process.  Instead,  new  files  are 
created  in  scratch  disk  space  directly  attached  to  the  Task 
Broker  server  and  returned  to  the  Task  Broker  client  at 
job  termination,  subject  to  fde-size  constraints.  If  the  new 
file  is  too  large  for  the  domain  of  the  Task  Broker  chent, 
the  file  is  copied  to  the  AFS  server  in  the  proper  multi- 
computer domain  with  full  user  access  privileges.  (Unlike 
NFS,  writes  by  an  AFS-cUent  to  a  remote  AFS  filesystem 
are  cached  on  a  local  AFS-chent  disk  so  that  I/O  wait- 
time  is  significantly  reduced.)  The  user  is  notified  of  the 
file's  AFS  path  name  during  the  job-completion  notifica- 
tion sequence  described  in  the  previous  section.  The  AFS 
file  will  be  automatically  scratched  120  hours  after 
creation,  so  the  user  must  execute  an  administrative 
action  to  save  the  file. 

ImportanUy,  scratch  disk  space  on  Task  Broker  servers  is 
supported  by  the  central  file  migration  service  described 
in  the  next  section.  Essentially,  file  migration  service 
provided  by  the  Storage  Server  in  the  HPPI  domain 
ensures  that  scratch  disk  space  cannot  fill,  subject  to 
configuration  Umits  on  the  Storage  Server.  Hence,  the 


user  is  protected  against  early  job  termination  when  new 
file  creation  at  the  Task  Broker  server  exceeds  scratch 
disk  capacity. 

Hierarchical  Storage  Schemes  and  File  Migration 
Service 

The  Storage  Server  shown  at  the  top  of  Figure  1  is  a 
formal  MSSRM  service  that  separates  high-level  UNIX 
and  network  filesystem  facilities  from  the  physical 
structure  of  storage  devices.  The  SSPPCC  design  specifi- 
cation for  MSSRM  storage  services  is  a  client-server 
model  for  data  management  in  which  the  Storage  Server 
maintains  a  very  large  Ubrary  online,  responding  over 
network  links  to  requests  for  sequential  data  access  by 
workstation  clients.  ImportanUy,  client  access  to  a  library 
fde  is  managed  by  communications  software  on  both  the 
server  and  the  client  so  that  details  of  the  cUent-server 
interaction  are  hidden  from  a  program  running  on  the 
chent.  In  effect,  the  entire  Storage  Server  system  and 
communication  link  appear  to  each  client  program  as  a 
very  large  disk  connected  to  the  client  workstation.  This 
structural  transparency  means  that  commonly  used 
commercial  software,  like  SAS  and  SPSS,  can  access 
files  in  the  library  with  no  change  in  existing  software  or 
user  procedures. 

Client  file-level  and  record-level  access  to  the  library 
must  be  fast  because  there  is  competing  demand  for 
server  attention  by  several  clients  and  the  library  files  can 
be  very  large.  For  example,  large  surveys  are  often  read 
several  times  for  data  extraction  until  subsequent  prehmi- 
nary  analyses  satisfy  the  research  group  that  sufficient 
data  for  the  research  inquiry  is  obtained.  But  the  library 
is  too  large  to  maintain  on  a  collection  of  fast  disks  con- 
nected to  the  server.  The  capacity  for  maintaining  the 
entire  library  online  for  sequential  access  is  available 
only  in  slow-speed  tape  robots  or  optical  juke  boxes. 
This  conflict  between  speed  and  size  is  reconciled  in 
operation  by  an  internal  Storage  Server  structure  called 
hierarchical  storage.  In  a  hierarchical  storage  scheme, 
the  Storage  Server  is  equipped  with  a  slow-speed 
hierarchy,  say,  a  tape  robot  large  enough  to  hold  the 
entire  library,  plus  a  high-speed  hierarchy,  a  collection  of 
very  fast  disks  with  capacity  about  10%  of  the  tape 
system  (depending  on  summed  sizes  of  files  in  common- 
use).  The  disks  act  as  a  cache  for  files  requested  from  the 
tape  library,  as  follows.  When  a  client  request  is  re- 
ceived, the  server  checks  its  disk  cache  to  determine  if 
the  requested  file  is  already  resident  there.  If  so,  the  file 
is  shipped  to  the  client  immediately.  If  not,  the  server 
reads  the  file  from  tape  and  simultaneously  writes  each 
record  to  its  disk  cache  and  to  the  chent.  Subsequent 
requests  for  that  file  will  then  be  found  already  in  the 
cache.  In  effect,  the  cache  is  a  small  "window"  to  the 
large  Ubrary  through  which,  at  any  one  time,  clients  can 
view  the  currently  active  files  at  high  speed.  To  maintain 


lASSIST  Quarterly 


available  space  in  the  relatively  small  cache,  in  idle 
moments  the  storage  server  executes  an  aging  algorithm 
on  cache  contents  that  is  parameterized  by  system  staff 
for  sensitivity  to  the  identity  of  cached  files  and  the  time 
elapsed  since  last  reference.  If  a  file  is  designated  for 
removal  from  the  cache,  it  is  simply  erased  if  it  has  not 
been  changed,  or  rewritten  to  tape  in  place  of  the  original 
copy  if  it  has  changed.  Efficient  tape  usage  is  automati- 
cally administered  by  the  server. 

A  storage  server  is  typically  programmed  to  provide 
other  essential  support  functions,  such  as  network-wide 
disk  backup  to  the  slow-speed  hierarchy  and  onhne 
library  administration  and  documentation. 

As  mentioned  in  the  Overview,  Storage  Server  software 
is  available  today  from  ACSC  (UniTree  running  on  an 
RS/6000  under  AIX  3.2)  and  Epoch  (RMS  running  on  a 
SPARC  Station  under  SunOS  4.1.2).  The  products 
provide  essentially  the  same  hierarchical  storage  manage- 
ment, but  UniTree's  Central  File  Manager  (CFM)  only 
operates  as  a  single  NFS  file-server  while  RMS  operates 
as  a  networked  disk  analogue  that  can  support  multiple 
independent  NFS  file-servers  (by  file  migration  as 
described  below). 

To  configure  the  Storage  Server,  one  must  specify  the 
workstation  base  system,  the  disk  cache,  and  the  large- 
capacity  library  unit.  As  an  expensive  look-ahead  ex- 
ample, a  UniTree  Storage  Server  reahzed  in  early  1993 
by  an  IBM  RS/6000  Model  560  with  128  MB  RAM  and 
equipped  with  an  IGM  270  GB  Exabyte  tape  carousel,  a 
dual-ported  27  GB  Seagate  Elite-3  enhanced  IPI-2  disk 
cache  (eight  3.38  GB  IPI  disks),  plus  IPI  and  HPPI 
interfaces  for  the  RS/6000  microchannel  bus  will  deliver 
cached  data  from  the  270  GB  library  to  clients  at  effective 
rates  in  excess  of  3  MB/sec  (1  GB  in  5  minutes),  about 
five  times  faster  than  current  SCSI-2  effective  rates.  (The 
Elile-3  disk  and  HPPI  interface  will  not  be  announced 
until  Fall,  1992.  Such  an  assembly  with  2  HPPI-equipped 
chents  would  cost  an  estimated  $350,000  at  academic 
pricing.)  More  conservatively,  in  July,  1992,  UniTree 
driving  the  same  workstation  and  tape  carousel  but  with  a 
1 5  GB  EUte-2  IPI  cache  and  FDDI  interfaces  instead  of 
HPPI  could  deliver  cached  data  to  clients  at  about  half 
the  speed  for  half  the  price.  If  data  is  not  in  the  cache 
when  requested,  the  tape  unit  dehvers  data  at  roughly 
nine-track  tape  rates. 

The  speed  of  the  disk  cache  is  an  important  determinant 
of  Storage  Server  data  delivery  rate.  RAID,  RAM-disk, 
and  SCSI  fast-and-wide  technologies  are  all  faster 
alternatives  to  the  IPI  disk  cache  specified  above.  RAID 
and  RAM-disk  are  significantly  more  expensive  per 
megabyte  and  SCSI  fast-and-wide  availabiUty  is  indeter- 
minate at  this  time,  hence  SSPPCC  prefers  IPI  technol- 


ogy today. 

Importantly,  both  ACSC  and  Epoch  software  offer  file 
migration  service.  Essentially,  file  migration  extends  the 
Storage  Server's  disk  cache  management  capabilities  to 
every  disk  in  the  multicomputer.  For  each  filesystem  on  a 
disk  attached  to  any  workstation,  a  logical  file  space 
local  to  the  wwkstation  is  defined  larger  than  the  physi- 
cal filesystem  space.  The  total  file  space  is  ^tually 
available  in  the  slow-speed  hierarchy  at  the  Storage 
Server.  The  local  filesystem  is  managed  as  a  window  of 
active  files  on  the  larger  logical  file  space,  exacdy 
analogous  to  the  Storage  Server's  primary  disk  cache 
management.  File  migration  is  available  for  UniTree  as  a 
separate  facility  called  the  Distributed  File  System 
Manager  (DFSM).  In  direct  contrast,  file  migration  is 
intrinsic  to  Epoch's  RMS. 

Current  Implementation 

The  current  status  of  multicomputer  implementation  is 
diagrammed  in  Figure  2.  SSPPCC  has  connected  pubUc 
and  private  workstations  below  the  FDDI  domain  with 
extensive  LocalTalk  and  Ethernet  Unks  and  is  in  the 
process  of  evaluating  FDDI  concentrators  from  four 
vendors  by  inter-operabiUty  and  performance  criteria. 
When  the  evaluation  is  complete,  six  workstations  will 
be  connected  by  FDDI  links  in  the  summer  of  1992  as 
the  initial  realization  of  the  FDDI  domain.  Figure  3 
shows  the  planned  extension  of  the  ciurent  connectivity, 
including  construction  of  the  HPPI  domain  loosely 
scheduled  for  Summer,  1993.  Figure  4  is  a  schematic 
diagram  of  the  logical  structiu^e  in  Figure  3  constriKted 
in  four  campus  buildings  in  the  SSPPCC  domain. 

The  remainder  of  this  section  presents  ongoing  consid- 
erations in  the  refinement  of  the  SSPPCC  design. 

The  current  SSPPCC  sensibility  about  the  choice  be- 
tween UniTree  and  RMS  favors  RMS  for  two  ill-defined 
reasons:  (1)  RMS  originated  in  the  workstation  industry 
whereas  UniTree  originated  in  mainframe  distributed 
computing  environments.  It  is  possible  that  RMS  is  more 
mature  as  UNIX  software — more  efficient  and  more 
stable.  (2)  In  MSSRM  terms,  UniTree  is  fundamentally 
constructed  as  a  high-level  service  (NFS  file  service) 
integrated  with  the  Storage  Server  primitive;  hence  it  is 
"too  big".  RMS  provides  the  primitive  MSSRM  Storage 
Server  function  and  hence  can  naturally  support  a 
multitude  of  MSSRM-compatible  high-level  services 
without  receding.  In  this  sense,  the  design  of  RMS  is 
"cleaner".  Consistent  with  this  sensibility,  Epoch's 
literature  states  commitments  to  Open  Systems  strate- 
gies, describing  plans  for  port  of  the  MSSRM  Storage 
Server  from  SPARC  to  IBM,  DEC,  HP,  and  SGI  archi- 


Fall/Winter  1993 


Cabletron 

16  Mb/sec 

Token  Ring 

Hub 


FDDI 
Domain 


IBMRS/6000 
Model  320 
—    32  MB  RAM 

Cmridge  |-  -|  CD-ROM  | 


KM  Token  Ring 
FDDI 


^    DEC  FDDI 
Concentrator 


HP  9000 
Model  750 

with 
64  MB  RAM 


HP  9000^30 
32MB  RAM 


HP  9000/730 
32  MB  RAM 


HP  9000/730 
32  MB  RAM 


ClACO 

AGS+ 
Router 


Campus 
Ethernet  - 
Backbone 


Ethernet 
Domain 


Sun  SPARC  2 
32  MB  RAM 


Sun  SPARC  2 
32  MB  RAM 


Sun  SPARC  2 
32  MB  RAM 


Sun  SPARC  1 
64MB  RAM 


Task 
Broker 
proxy 


HP  9000 
Model  750 


with 
64  MB  RAM 


Local  talk 
Domain 


File 
Servers 


Primary 
Servers 


SpiU 
Servers 


Login 
Server 


Apple  Mac  11 


Figure  2:  Schematic  of  Current  Social  science  network  with  FDDI  evaluation 


22 


lASSIST  Quarterly 


"°°  gpppgggg 


HPPI 

Domain 


Storage 
Server 


FUe 
Servers 


Primary 
Servers 


SpUl 
Servers 


Login 
Server 


Figure  3:  Social  Science  Network  with  HPPI  and  FDDI  extensions 


Fall/Winter  1993 


23 


Ifp    HPPI  Switch 
O^  File  Server 
\P/  Primary  Server 
^S^  Spill  Server 
I  Tr    Terminal 
W      Workstation 
Ca)    Printer 


Codenoll  FOIRL 
with  AUl  Cable 

HPPI  Link 
FDDlLink 
Ethernet  Link 


File 
Servers 


1155  Building 


Figure  4.  Social  Science  and  Public  Policy  Network 
with  Planned  Extensions  for  Fiscal  Year  1992-93 


24 


lASSIST  Quarteriy 


The  RMS  port  to  IBM  POWER  architecture  is  very 
important  because  the  primary  platform  identified  for  the 
HPPI  domain  is  the  IBM  RS/6000  running  AIX  3.2.  The 
IBM  platform  is  specified  for  the  HPPI  domain  today 
because  it  is  the  only  system  which  projects  the  special- 
ized hardware  and  software  requirements  of  the  HPPI 
domain  in  the  foreseeable  calendar.  (However,  the  new 
DEC  Alpha  series  departmental  server  might  be  an 
alternative,  pending  further  information  from  DEC.)  For 
required  hardware,  IBM  and  third-party  suppliers  are 
preparing  implementations  of  HPPI  for  high-speed  inter- 
computer data  communications  and  IPI-2  interfaces  for 
high-speed  disk  I/O.  Furthermore,  the  CPU  speed  of  the 
high-end  POWER  architecture  currently  rivals  the  HP 
700  series  and  is  thus  suited  to  intense  data-handling 
tasks,  particularly  buffer-to-buffer  moves  and  compres- 
sion. AIX  is  the  only  version  of  UNIX  that  allows  file 
definitions  to  span  physical  volume  boundaries  so  file 
sizes  are  not  constrained  by  disk  sizes.  AIX  also  auto- 
matically maps  open  data  files  into  permanent  segments 
of  the  virtual  memory  sub-system,  so  an  intrinsic  dynamic 
RAM-caching  for  data  is  always  in  operation.  AIX  3.2 
also  includes  intrinsic  disk-mirroring  for  critical  file 
integrity  and  deferred  I/O  for  true  overlap  of  I/O  and  CPU 
services  to  a  single  process.  For  these  reasons,  if  opera- 
tional pressures  to  acquire  file  migration  service  force 
acquisition  of  RMS  on  SPARC  architecture  in  the  near- 
term,  SSPPCC  will  pursue  a  subsequent  technical  evalu- 
ation to  determine  the  wisdom  of  converting  the  RMS 
license  to  another  architecture  such  as  POWER. 

There  is  a  disadvantage  of  the  choice  for  IBM — an 
apparent  lack  of  near-term  plans  for  multi-processor 
POWER  architecture.  Multi-processing  is  potentially 
significant  in  the  Storage  Server  for  parallelism  of  data 
shipment,  data  compression,  and  data  uncompression.  To 
explain,  one  of  the  primary  limitations  on  effective  data 
transfer  rate  is  the  so-called  "internal  rate"  of  the  disk — 
the  speed  at  which  data  bytes  are  transferred  from  the 
disk  surface  to  the  disk  controller's  buffers.  A  quick 
strategy  for  significantly  increasing  that  rate  is  data 
compression  prior  to  writing  the  data,  so  that  intemal  byte 
transfers  are  minimized.  Of  course,  data  must  then  be 
uncompressed  upon  reading.  I/O  interface  designers  claim 
that  adding  available  hardware  logic  for  data  compression 
and  uncompression  onto  the  controller  slows  the  transfer 
rate  well-below  data  bus  speeds.  That  explains  why 
hardware  compression  in  the  device  controller  is  only 
found  on  slow  devices  like  tape  drives.  Hence,  data 
compression  and  uncompression  are  today  ideally  done 
by  a  main  processor.  Since  data  compression  logic  is  very 
CPU-intensive,  the  parallelism  provided  by  high-perform- 
ance symmetric  multi-processor  architecture  would 
probably  significantly  enhance  Storage  Server  perform- 
ance. Such  UNIX  workstation  architectures  are  reported 


for  release  by  Sun  in  the  SPARC  3  series,  HP  in  the  700 
series,  and  DEC  in  the  Alpha  series.  Perhaps  one  of  these 
other  vendors  will  offer  the  requirements  of  the  HPPI 
domain  in  a  suitable  calendar. 

Solutions  for  relational  data  base  management  are 
weakly  researched  and  unspecified  in  the  SSPPCC 
design  at  this  time.  Major  social  science  databases  are  not 
available  as  relational  structures  and  common-use 
software  mostly  requires  rectangular  flat  files,  hence 
most  database  access  in  the  SSPPCC  environment  has 
not  used  relational  queries.  (The  R-Squared  system 
mentioned  in  the  Overview  is  one  exception.)  Nonethe- 
less, relational  DBMS  strategies  are  an  investigative 
direction  for  SSPPCC  to  be  guided  by  the  newly-fOTmed 
Faculty  Database  Committee.  The  current  proposed 
design  specifies  networked  relational  database  servers 
supported  by  file  migration  service.  Initial  conversations 
with  Sybase  are  promising  for  successful  integration  of 
database  server  and  file  migration  facilities,  but  essential 
operational  testing  will  await  implementation  of  the  file 
migration  server.  In  general,  SSPPCC  believes  that  wise 
planning  must  specify  a  DBMS  server  that  is  supported 
by  an  implementation  of  the  MSSRM,  but  these  concepts 
appear  to  be  weakly  considered  by  the  major  DBMS 
vendors  today. 

Integration  of  the  AFS  server  facility  with  MSSRM- 
based  systems  is  not  available  today  because  the  current 
AFS  3.2  database  server  uses  non-standard  UNIX 
filesystem  structures  on  server  volumes.  However, 
integration  with  file  migration  service  appears  well- 
considered  by  ACSC  and  Epoch  and  will  probably  await 
shipment  of  AFS  4.0  as  the  Distributed  File  System 
standard  of  the  OSF  1.0  DCE  in  1993. 

It  should  be  clear  that  the  SSPPCC  design  is  a  compre- 
hensive specification  for  the  technical  steps  in  imple- 
menting distributed  computing  services.  Perhaps  its  most 
important  feature  for  the  social  science  computing 
environment  at  the  University  of  Chicago  is  the  coher- 
ence that  it  brings  to  technical  planning.  A  large  collec- 
tion of  competing  vendors'  current  and  futiu^  hardware 
and  software  offerings  is  organized  for  selection  by  the 
criteria  that  emerge  from  the  design  specifications. 

Wide-Area  Extension  of  the  Design 

So  far,  this  article  has  described  a  sffategy  for  campus- 
wide  large-scale  computing  services  provided  by  a  local 
multicomputer.  Extension  of  these  services  to  a  national 
(or  even  international)  user  community  are  quickly 
realizable  because  existing  wide-area  networking  facili- 
ties already  provide  the  essential  data  communication 
links  and  the  SSPPCC  design  intrinsically  manages 
services  so  that  the  links  can  be  used  efficiently. 


FallA/Vinter  1993 


25 


The  Ethernet  domain  of  the  Chicago  campus  is  connected 
to  the  Ethernet  domains  of  every  campus  on  the  Internet 
via  wide-area  links  operating  at,  say,  nominal  1  Mbit/sec. 
Suppose  a  relationship  with  a  project  on  a  remote  campus 
for  use  of  the  SSPPCC  multicomputer  is  established.  In 
the  simplest  case,  project  staff  could  lelnet  to  an  SSPPCC 
Login  Server  and  submit  jobs.  Ideally,  if  the  remote 
project's  computer  system  is  a  member  of  an  APS  cell, 
file  structures  on  SSPPCC  AFS  servers  could  be  made 
directly  available  to  the  staff  through  routine  APS  file- 
sharing. 

More  interestingly,  suppose  a  remote  user's  workstation 
is  running  a  Task  Broker  daemon  or  is  recognized  by 
either  a  local  or  SSPPCC  Task  Broker  Proxy  Machine. 
Then  SSPPCC  multicomputer  services  could  be  transpar- 
ently provided  to  the  user.  For  example,  a  remote  user 
might  invoke  an  extraction  via  SAS  from  a  large  data 
base  maintained  in  the  Chicago  Storage  Server  library. 
Affinity  calculations  would  automatically  locate  the  job 
in  the  SSPPCC  HPPI  domain  and,  after  SAS  execution  on 
an  SSPPCC  File  Server,  the  Task  Broker  service  script 
could  ship  the  resultant  extracted  file  back  to  the  user's 
workstation  (subject  to  the  file-size  constraints  and 
associated  procedures  described  above).  The  user  could 
prepare  the  job  as  if  it  were  to  execute  locally  and  might 
never  know  that  the  job  was  brokered  to  a  remote  site. 

Now  suppose  that  several  such  multicomputers  are 
operating  on  the  national  Internet  and  the  project  has 
estabUshed  a  service  relationship  with  each  of  them.  Then 
the  user's  file  extraction  would  be  transparently  brokered 
across  the  community  of  multicomputers,  each  site 
calculating  its  affinity  based  on  current  load  and  the 
availability  of  the  requested  source  file.  (Source  files 
would  have  to  be  referenced  by  generic  names  that  are 
resolved  at  each  site  into  local  pathnames.  The  MSSRM 
Nameserver  function  is  one  strategy  for  such  name 
translation.) 

This  last  example  is  a  scheme  for  national  computerized 
database  libraries  accessed  transparently  by  widely 
distributed  projects.  Though  there  are  probably  political 
and  academic  issues,  such  libraries  can  be  established 
with  relatively  low-cost  technology  as  described  above. 
(Most  of  the  long-term  costs  would  be  required  for 
support  of  Ubrary  staffs.)  In  general,  once  a  multicom- 
puter is  providing  services  to  a  local  network,  extension 
of  those  services  to  a  wide-area  network  is  not  a  difficult 
technological  issue.  Such  extension  accomplishes  the 
epitome  of  distributed  computing  services. 

'  Paper  presented  at  the  lASSIST  92  Conference  held  in 
Madison,  Wisconsin,  U.S.A.  May  26  -  29,  1992.  George 


Yales,  Director,  Social  Science  &  PubUc  Policy  Com- 
puting Center  (SSPPCC),  Uuniversity  of  Chicago,  1 155 
East  60th  Street,  Chicago,  Illinois  60637.  (312)702- 
0793 

^  For  a  detailed  description  of  the  IEEE  Reference 
Model,  refer  to  the  document  entitled  Mass  Storage 
System  Reference  Model,  Version  4,  edited  by  Messrs. 
Sam  Coleman  and  Steve  Miller  and  published  by  the 
IEEE  Technical  Committee  on  Mass  Storage  Systems 
and  Technology. 


26 


lASSIST  Quarteriy 


IASSIST1995 


Partners  for  Access:  Working  together  In  a  changing  data  environment 
L'acces  aux  donnees  dans  un  envlronnement  en  pleine  mutation:  un  partenariat 

a  developper 

Quebec  City,  Canada.  May  9-12, 1995 

2 1st  Annual  Conference  of  the  International  Association  for  Social  Science 
Information  Service  and  Technology 

■ASSIST  Conference  &  Workshops 

The  premier  conference  for  professionals  providing  data  services  in  libraries  and  archives,  lASSIST  1995 
focuses  on  the  new  opportunities  for  collaboration  presented  by  the  phenomenal  growth  of  worldwide 
computing  networks.  Taking  advantage  of  this  new  technology  presents  many  challenges,  and  encourages 
data  service  providers  to  work  together  to  ensure  continued  access  to  useful  and  high-quality  data. 

Conference  Location: 

Loews  Le  Concorde 
1225,  Place  Montcalm 
Quebec,  QCG1R4W6 
Canada 

Conference  Site: 

The  perfect  site  for  a  conference,  the  Loews  Le  Concorde  in  Quebec  City  is  located  on  the  Grande  Allee,  an 
avenue  characteristic  of  old  Europe.  Across  from  the  hotel  are  the  Plains  of  Abraham,  a  distinguished  setting 
draped  in  a  curtain  of  greenery  offering  a  magnificent  view  of  the  St.  Lawrence  River.  A  brief  walk  from  the 
hotel  is  Qld  Quebec,  the  only  walled  city  in  North  America.  Within  Old  Quebec,  winding  cobblestone  streets 
are  lined  with  colourful  bistros  and  quaint  boutiques. 

The  conference  hotel  features  424  rooms  and  spacious  suites.  Each  room  has  two  telephone  lines  and  a 
computer/fax  connection.  Hotel  services  include  concierge  assistance,  secretarial  services,  and  health  club 
(heated  pool  in  season.)  The  hotel  is  20  minutes  from  Quebec  International  Airport. 


Hotei  Rates: 

Conference  hotel  rates  are  $109  Canadian  nightly  (taxes  not  included)  for  single  or  double 
occupancy.  To  receive  the  conference  rate,  mention  that  you  will  be  attending  IASSIST95.  Hotel 
Reservations  must  be  made  before  April  7, 1995. 

Canada     1-800-463-5256       USA        1-418-647-2222       Fax        (418)647-4710 

Conference  &  Workshop  Rates: 

Member     Non-Member 


Conference  Qnly 

$250 

$350 

Workshop  Only 

$150 

$150 

Conference  &  Workshop 

$300 

$400 

Non-member  conference  rate  Includes  a  1-year  membership  in  lASSIST. 

♦  A  late  fee  of  $50  will  apply  to  conference  or  workshop  registration  received  after  April  9,  1995.  ♦ 
^  ^  An  additional  $50  fee  will  be  charged  for  in-person  registration.  ♦  ♦ 


FallAVinter   1993  27 


■ASSIST  Workshops 


An  Immersion  in  tlie  World  Wide  Web 


For  those  boking  for  an  immersion  opportunity  in  World  Wide  Web  (WWW)  technology,  a  Web  Thread  is 
available  by  registering  in  Workshops  #1  &  #2.  Combining  these  two  workshops  will  provide  you  with  a 
sound  introduction  to  infomiation  services  through  WWW  technology  and  several  examples  of  how  to  apply 
the  Web  in  data  libraries  and  archives.  If  you  have  a  wort<ing  knowledge  of  the  Web,  you  may  decide  to 
enroll  only  in  the  Web  Applications  wori<shop  (#3).  Wori<shop  #2  is  reserved  for  those  who  wish  to  take  the 
Web  Thread. 


Morning  Workshops 

Wort<shop  1 :  Worid  Wide  Web  Basics  for  Information  Specialists 
Wori<shop  3:  Applying  the  World  Wide  Web  in  Data  Libraries  and  Archives 
Wori<shop  4:  An  Introduction  to  Providing  Data  Library  Services 
Afternoon  Workshops 

Wort<shop  2:  Applying  the  World  Wide  Web  in  Data  Libraries  and  Archives 
Wori<shop  5:  The  Fundamentals  of  Geographic  Information  Systems 
Workshop  6:  Exploring  Models  of  Providing  Client  Support  for  CANSIM 


lASSIST  Conference 


Concurrent  Session  1 

'  Delivering  Data  on  the  Net:  where  access  meets 
he  pavement  of  the  information  highway 
'  Documentation  Standards:  nearing  harmonization 
3r  just  humming  a  tune? 

Census  Public  Use  Microdata  Products:  a  cross- 
lational  review 

Concurrent  Session  2 

'  Applications  of  Data,  the  Net  and  Services 
'  Enhancing  Access  to  Data  and  Information: 
services  and  tools 

'  Canadian  Longitudinal  and  Recurring  Surveys: 
defining  the  services 

Concurrent  Session  3 

When  the  Archivist  Hits  the  Information  Highway: 
avoiding  the  dusty  trail 

'  Enriching  Documentation:  new  sources  and  tools 
'  Barriers  to  Data  Flow:  copyright  and  access 

Concurrent  Session  4 

'  Cooperation:  stretching  the  network  into  a 
ixjmnnunity 
Bridging  Resources:  dealing  with  non-traditional 
social  science  data 

'  The  Politics  and  Technology  of  Access  to  Public 
Data 

Concurrent  Session  5 

Access  to  AIDS  Research  Data:  do  we  face  data 
deficiency 


*  The  Data  Librarian/Archivist:  is  this  an  obsolete 
profession? 

*  Access  to  GIS/Mapping  and  Data:  methods  and 
applications 

Conference  Fee  Includes: 

*  Attendance  at  all  plenary  and  concurrent 
sessions. 

*  Materials  and  handouts  presented  at  sessions, 
unless  othenwise  stated  by  conference  speakers. 

*  Evening  reception  on  Tuesday,  May  9th  at  the 
Conference  Hotel. 

*  Three  continental  breakfast,  May  10-1 2th  at  the 
Conference  Hotel. 

*  Sugar  shack  dinner  on  Wednesday,  May  10th 
(Family  members  welcome  at  no  additional 
charge). 

*  Roundtable  lunch  on  Thursday,  May  1 1th  at  the 
Conference  Hotel. 

*  lASSIST  Business  lunch  on  Friday,  May  12th  at 
the  Conference  Hotel. 

Workshop  Fee  Includes: 

*  Attendance  at  one  full-day  or  two,  1/2  day 
sessions. 

*  Materials  and  handouts  presented  at  wort<shop, 
unless  otherwise  stated  by  workshop  instructors. 
'  Evening  reception  on  Tuesday,  May  9th  at  the 
Conference  Hotel. 

'  Combined  conference  and  workshop  fees 
include  all  of  the  above. 


28 


lASSIST  Quarterly 


CONFERENCE  REGISTRATION  FORM 


WORKSHOP  SELECTION 

You  may  attend  two  half-day  sessions.    Please  indicate  (check  one)  a  1st  and  2nd  choice  for  a  morning  and  after- 
noon workshop.  If  no  2nd  choice  Is  indicated,  and  your  1st  choice  is  not  available,  the  workshop  fee  will  be  refunded. 
Some  workshops  (noted  with  an  'R")  have  a  restricted  enrollment  because  of  a  limited  number  of  computers  in  labs. 
People  who  register  for  Workshop  #1  will  be  given  preference  for  Workshop  #2. 


Choice 


MORNING  HALF-DAY  SESSIONS: 
R    1 .  World  Wide  Web  Basics  for  Information  Specialists 
R    3.  Applying  the  World  Wide  Web  In  Data  Libraries  and  Archives 
4.  An  Introduction  to  Providing  Data  Library  Services 

AFTERNOON  HALF-DAY  SESSIONS; 

R    2.  Applying  the  World  Wide  Web  In  Data  Libraries  and  Archives 
R    5.  The  Fundamentals  of  Geographic  Information  Systems 
6.  Exploring  Models  of  Providing  Client  Support  for  CANSIM 


Conference  Only 
Conference  &  Workshops 
Workshop  Only 
Late  Fee  (after  April  9,  1995) 
Payment  at  the  conference 


TOTAL  AMOUNT  ENCLOSED  (Payable  In  Canadian  dollars  to  IASSIST95)  TOTAL:  $ 
SUGAR  SHACK  DINNER  (Indicate  number  attending: 
WILL  ATTEND  ROUNDTABLE  LUNCH  □  yes  □  no 
WILL  ATTEND  BUSINESS  LUNCH  □  yes  □  no 
Person  to  contact  in  case  of  emergency;  Name: Phone: 


□  1st 

□  2nd 

□  1st 

□  2nd 

□  1st 

□  2nd 

□  1st 

□  2nd 

□  1st 

□  2nd 

□  1st 

□  2nd 

Member 

Non-Member 

□  $250 

□  $350 

□  $300 

□  $400 

□  $150 

□  $150 

□  $50 

□  $50 

□  $50  -1-  Late  fee      □  $50  +  Late  fee 

TOTAL:  $ 

•1 

Cnd 

NAME: 


TITLE: 


AFFILIATION: 


MAILING  ADDRESS: 


ELECTRONIC  MAIL  ADDRESS: 


TELEPHONE: 


FAX: 


Please  return  this  form  before  April  9  1995  to: 

Gaetan  Drolet 

BIbliotheque 

Unlverslte  Laval 

ATTN:  IASSIST95 

Ste-Foy,  Quebec  CANADA  G1K  7P4 


Fax  (418)  656-3048 


E-mail:  gdrolet@blbl.ulaval.ca 


You    can     find    the     latest    developments     in    the    program    on    the    Conference    Homepage     at: 
http://datalib.library.ualberta.ca/-humphrey/ia8  8i8t95.html 


Fall/Winter  1993 


29 


Integration  of  "Traditional"  Data  Centre  Services  With  GIS 
Technology 


by  Jeffery  Moon' 

Documents  ReferencelData  Centre  Librarian 

Social  Science  Data  Centre,  Queen's  University 


Introduction 

Recent  cooperation  between  Queen's  University's 
Documents  Unit^  and  the  Queen's/IBM  Geographic 
Information  Systems  (GIS)  Lab  has  resulted  in  the 
creation  of  a  GIS  satellite  site  in  the  Documents  Unit. 
This  GIS  service  point  is  housed  in  the  Social  Science 
Data  Centre  and  consists  of  one  network-connected  IBM 
PC  with  related  hardware  and  software. 

The  Data  Centre,  Map  Library,  and  GIS  Lab  are  working 
together  through  the  new  satellite  site  to  meet  the  basic 
mapping  and  data  integration  needs  of  students  and 
researchers. 

An  initial  component  of  this  project  has  been  the  testing 
and  application  of  Statistics  Canada's  new  graphing/ 
mapping  package,  E-STAT.  E-STAT  combines  access  to 
Cv^SIM  time  series  and  1986  Canadian  Census  data  on 
CD-ROM,  and  permits  these  data  to  be  mapped  or 
graphed  using  straightforward,  menu/mouse-driven 
software.  Other  initiatives  include  the  installation  of  OS/ 
2,  Windows,  and  other  GIS  software,  and  increased 
network  access  to  data/map  resources. 

Geographic  Information  Systems  (GIS)  are  computer 
systems  that  store  and  link  nongraphic  attributes  or 
geographically  referenced  data  with  graphic  map  features 
to  allow  a  wide  range  of  information  processing  and 
display  operations,  as  well  as  map  production,  analysis, 
and  modelling. ' 

Readers  will  soon  discover  that  the  initiatives  described  in 
this  paper  do  not  yet  meet  fully  the  rigours  of  this  defini- 
tion. Instead,  this  paper  will  describe  a  cooperative 
venture  between  the  Queen's  University  Documents  Unit 
and  the  Queen's/IBM  Geographic  Information  Systems 
Lab  which,  while  not  yet  providing  "uue"  GIS,  promises 
improved  service  to  our  users.  Specifically,  the  following 
will  be  covered: 

Background  information  on  the  Data  Centre  and 
Queen's/IBM  GIS  Lab 

Integration  of  services  in  the  Queen's/IBM  GIS 
Satellite  Site 

Future  Develq)ments 


Background 

Documents  Unit  -  Social  Science  Data  Centre:  History 

The  Documents  Unit  is  part  of  the  Queen's  Library 
System,  and  consists  of  the: 

Government  Documents  Library 

Map  and  Air  Photos  Library 

:Social  Science  Data  Centre 

The  Data  Centre  was  started  in  1982  in  response  to  a 
growing  need  to  consolidate  the  management  of  ma- 
chine-readable data.  Its  mandate  was  to  act  as  a  clearing- 
house for  acquiring,  archiving,  and  accessing  these  data. 
Initially,  support  was  provided  by  library  staff  and  one 
contract  programmer.  A  Data  Librarian  was  hired  in 
1987  to  work  half-time  in  the  Data  Centre  and  half-time 
as  a  Documents  Reference  Librarian. 

Holdings 

Holdings  include  Canadian  Census  data,  survey  data 
from  Statistics  Canada  and  other  sources,  and  time-series 
data  (CANSIM,  IPS,  CITIBASE).  Until  recenUy,  the 
only  mapping  data  held  by  the  Data  Centre  have  been  the 
"CARTLIB"  files  produced  by  Statistics  Canada.  The 
Data  Centre  also  has  several  full-text  sources  on  CD- 
ROM  (e.g.  Auditor  General  Annual  Reports,  CRTC 
Decisions). 

Service 

The  Data  Centre  provides  a  wide  range  of  services 
including: 

:Reference 

:Data  access/acquisition 

:Computing/statistical  assistance 

:Data  conversion/manipulation 

Before  the  GIS  initiative,  the  Data  Centre  had  no  pubhc 
computing  facilities.  Only  one  computer  was  available, 
for  use  by  the  Data  Centre  Librarian. 


30 


■ASSIST  Quarteriy 


Queen's/IBM  Geographic  Information  Systems 
Laboratory:  History 

The  Queen's/IBM  GIS  Lab  opened  its  doors  in  Novem- 
ber, 1990.  Its  mandate  is  to  provide  service  and  support 
innovation  in:  teaching  (both  high  school  and  university), 
research,  and  community  GIS  needs. 

Hardware/Software/Networks 

Supported  in  large  part  by  donations  from  IBM  Canada 

Ltd.,  the  GIS  Lab  has  been  equipped  as  follows: 

Computers:  10  IBM  PC  PS/2  computers 

Workstations:  1  RISC  6000  Workstation 

General  Hardware:  a  variety  of  printers,  plotters, 
digitisers 

Network:  Token  Ring  LAN  and  connected  to  Campus 
Ethernet  Backbone 

Operating  System:  IBM  OS/2  V2.0:  chosen  to 
enhance/permit  -  networking,  distributed  access  to 
ARC/INFO,  multitasking  &  abiUty  to  run  DOS/ 
WINDOWS 

Software:  Arc/Info,  Geo/SQL,  AutoCAD, 
Atlas*Graphics,  SPANS 

Existing  satellites:  Biology,  Education,  Health 
Sciences,  Geology  and  now  the  Documents  Unit 

Service 

To  date,  the  GIS  Lab's  major  focus  has  been  on  its 
service,  teaching,  and  community  involvement  mandates. 
Within  Queen's,  the  primary  focus  has  been  on  "high- 
end"  mapping  (large  formats,  pen-plotters...)  and  teaching 
the  theory  and  practice  of  GIS.  As  well,  in  conjunction 
with  the  Faculty  of  Education,  the  Lab  is  currently 
involved  in  a  two-year  project  to  develop  a  GIS  curricu- 
lum for  Ontario's  Secondary  Schools. 

In  the  community,  the  Lab  has  been  heavily  involved  in 
local  mapping  initiatives  and  civic  addressing.  Both  of 
these  activities  are  related  to  planning  for  the  dehvery  of 
emergency  services  via  the  "911"  system. 

Research  uses  are  growing  as  the  user  community 
becomes  aware  of  the  flexibility  of  GIS  in  non-geo- 
graphic appUcations  (e.g.  Ufe  sciences).  The  lab  is 
working  on  several  projects  with  health-care  planning 
themes. 

Integration: 

One  of  the  "problems"  faced  by  the  GIS  Lab  has  been  its 

success.  As  users  have  become  aware  of  the  potential  of 


the  mapping  services  available  in  the  GIS  Lab,  the  Lab 
has  experienced  increased  demand,  including  rapid 
growth  in  requests  for  "low-end"  mapping  such  as  base 
maps.  Satellite  sites  created  in  the  first  "phase"  of  the 
GIS  project  were  neither  mandated  nor  equipped  to 
handle  these  questions. 

To  meet  this  demand,  the  second  "phase"  of  the  project 
designated  the  Documents  Unit  as  a  satellite  site  with  an 
eye  to  providing  basic  mapping  and  data  integration 
facilities. 

There  was  some  debate  as  to  whether  the  GIS  satellite 
site  should  be  housed  in  the  Map  Library  instead  of  the 
Data  Centre.  Given  their  close  physical  proximity,  and 
staffs  wilUngness  to  work  together  in  servicing  the  the 
satellite  site,  the  Data  Centre  was  chosen  to  take  advan- 
tage of: 

:ready  connection  to  the  Ethernet  Backbone. 

:existing  hardware,  software,  and  data  resources 

:expertise  with  machine-readable  data 

Hardware/Software/Networks 

The  hardware  donated  by  IBM  to  equip  this  satellite  site 

included:. 

:IBM  PS/2  Model  57  SX  (386,  160  Mb  hard  disk,  4 
Mb  RAM) 

:IBM  CD-ROM  player 

:IBM  Ethernet  Card 

In  addition,  the  GIS  Lab  has  provided  the  use  of  an  HP 
Paintlet  (colour)  printer. 

The  Data  Centre  is  connected  to  the  campus  Ethernet 
backbone,  which  supports  FTP  and  CUTCP.  Given  this, 
the  potential  exists  for  extensive  sharing  of  files/software 
between  the  Queen's/IBM  GIS  Lab  proper  and  the  GIS 
satellite  sites. 

As  for  software  and  data,  the  Queen's/IBM  GIS  Lab  has 
arranged  a  six-month  trial  of  Statistics  Canada's  new 
mapping/graphing  package,  called  E-STAT.  A  parallel 
trial  is  scheduled  at  the  Education  satellite  site. 

E-STAT  by  Statistics  Canada 
E-STAT,  which  had  its  beginnings  in  "TeUchart", 
provides  user-friendly  access  to  CANSIM  ("Canadian 
Socio-economic  Information  Management  System"),  and 
1986  Census  data  on  CD-ROM.  The  CANSIM  CD-ROM 
contains  over  175,0(X)  series  selected  from  the  Statistics 


FallAVinter  1993 


Canada  "Mainbase"  of  over  500,000  time  series.  Census 
data  are  drawn  from  the  1986  Census  Profiles  CD-ROM 
and  exclude  all  but  the  Census  Subdivision  (CSD) 
Profiles. 

E-STAT  provides  basic  graphing  facilities  for  either 
CANSIM  or  Census  data.  A  variety  of  graph  types  and 
gr^h-enhancement  options  are  available.  Basically,  this 
portion  of  the  E-STAT  package  builds  on  Statistics 
Canada's  CANSIM  CD-ROM  software,  with  the  addi- 
tion of  enhanced  graphing  capabilities. 

As  for  majjping,  E-STAT  provides  "boundary  files"  for: 

Canada  by  province 

Canada  by  Census  Division 

Province  by  Census  Division 

Sub-Provincial  Region  by  Census  Subdivision 

Only  Census  data  is  mappable.  Hewlett-Packard  InkJet 
and  PaintJet  printers  are  supported. 

The  mapping  portion  of  E-STAT  is  designed  to  let  users 
produce  high-quality  colour  maps  with  relative  ease. 
Users  select  desired  data  (CANSIM  or  1986  Census), 
desired  geography  (as  listed  above)  and  the  program 
automatically  generates  a  high-resolution  colour  map  on 
the  computer  monitor. 

Retrieved  data  may  be  manipulated  to  create  new 
variables  (useful  in  creating  "percent"  from  "count" 
variables).  All  basic  arithmetic  operations  are  possible. 
Maps  and  graphs  can  be  saved  to  disk.  Map  display 
options  include: 

Zooming  in/out 

Panning  (shifting  the  position  of  the  map) 

Place  name  labels/Data  value  labels  (user-selected  and 
automatic) 

Changing  ranges  for  data  categories 

Changing  titles 

Adding  text  stringss. 

Service 

Before  the  arrival  of  the  GIS-related  hardware,  the  Data 
Centre  had  no  pubhc  computing  facilities.  Service  on 
ALL  fronts  has  been  vastly  improved  by  the  availability 
of  the  GIS  computer,  which  is  not  limited  to  GIS  func- 


Access  to  the  GIS  computer  is  scheduled  using  a  sign-up 
sheet  blocked  in  half-hour  intervals.  Users  are  free  to 
schedule  time  on  the  computer  for  any  GIS  or  Data 
Centre-related  function.  A  range  of  numeric/text  and  full- 
text  databases  are  available  on  CD-ROM  and  can  be 
accessed  using  this  machine 

From  a  reference/trouble-shooting  perspective,  the  second 
computer  permits  the  Data  Centre  Librarian  and  a  user  to 
be  connected  to  the  mainframe  simultaneously,  greatly 
enhancing  service. 

Future  Developments 

There  are  several  developments  underway,  and  several  in 
the  planning  stages.  Developments  are  described  below, 
under  the  Categories: 

Software/Data, 

Hardware/Networks,  and 

Service 

Software/Data 

Software 

As  new  software  becomes  available  in  the  GIS  Lab, 

satellite  sites  will  be  in  a  position  to  "upgrade"  their 

capabilities. 

For  example,  OS/2  version  2.0  is  scheduled  to  be  installed 
in  the  GIS  Lab,  and  will  be  installed  in  the  Documents 
Unit  satellite  site.  OS/2  will  enable  users  to  multi-task, 
running  different  sessions  concurrently  (e.g.  manipulating 
data  in  one  window  and  mapping  in  another,  running 
DOS  and  Windows  applications  concurrently). 

To  provide  more  flexibility  in  dealing  with  user-defined 
data  and  mapping  needs,  several  mapping  packages  are 
being  considered.  Primary  among  these  are  MAPmaker, 
provided  by  Strategic  Mapping,  and  its  "twin" 
MAPviewer  from  Golden  Software  (Strategic  Mapping 
bought  MAPviewer  from  Golden  Software  and  renamed 
it  MAPmaker).  Both  packages  run  under  Windows. 

Data 

The  Map  Library  will  soon  be  acquiring  the  "Digital  Map 
of  the  World",  the  product  of  a  long-term,  multi-nation 
project.  Scaled  at  1: 1,000,000  this  CD-ROM  based 
product  will  permit  direct,  user-friendly  access  to  basem- 
aps  from  around  the  globe.  Access  will  be  mediated 
through  the  Documents  Unit  GIS  satellite  site. 

On  a  more  local  scale,  digitised  maps  of  "high-demand" 
areas  (e.g.  the  city  of  Kingston)  are  available  in  the  GIS 


32 


lASSIST  Quarterly 


Lab  and  will  be  made  available  through  the  satellite  site. 
Once  appropriate  software  is  obtained,  it  will  be  possible 
to  integrate  these  maps  and  existing  data  sources  (e.g. 
Enumeration  Area  and  Census  Tract  data  from  the  1986 
Census). 

Users  can  import  their  own  data  into  E-STAT,  to  be 
mapped  using  the  existing  array  of  map  coordinate  files. 
This  could  prove  useful,  to  produce  custom  maps  for 
geographic  areas  included  with  E-STAT,  but  more 
flexibility  is  needed  to  meet  broader  user  demand.  E- 
STAT  would  be  greatly  improved  if  user-defined  map 
coordinate  files  could  be  imported. 

1991  Canadian  Census  data  on  CD-ROM  will  be  acces- 
sible using  E-STAT.  The  availability  data  for  two  census 
years  will  enhance  E-STAT's  usefulness  for  students  and 
researchers.  In  addition.  Statistics  Canada  is  planning  to 
include  other  social  and  environmental  variables  in  future 
releases  of  E-STAT. 

Hardware/Networks 

Hardware 

Hardware  upgrades  are  already  planned  in  response  to 
anticipated  data/software  acquisitions.  To  accommodate 
OS/2,  the  GIS  microcomputer  will  need  an  additional  4-6 
Mb  of  RAM,  bringing  it  to  a  total  of  8-10  Mb.  The 
"Digital  Map  of  the  World"  requires  a  math  co-processor 
board.  Other  hardware  upgrades  may  become  necessary 
as  the  "computer  processing"  demands  of  software  and 
data  grow. 

On  a  larger  scale,  the  CIS  Lab  is  using  an  IBM  550  RISC 
Workstation  as  a  server,  providing  access  to  GIS  soft- 
ware (Arc/Info)  over  the  ethemet  backbone.  Additional 
mainframe  (VM/ESA)  disk  storage  (DASD)  is  being 
obtained  as  well.  All  of  these  resoiu'ces  will  be  available 
to  authorised  users,  including  the  GIS  satellite  sites. 

In  addition,  certain  GIS  satellite  sites  are  expanding  their 
hardware  base  to  include  "X-stations"  (diskless  worksta- 
tions) to  provide  broader  access  to  their  facilities  at  a 
reasonable  cost. 

Networks 

The  Queen's/IBM  GIS  Lab  and  all  satellite  sites  will 
exploit  the  advantages  of  network  connectivity  to  the 
fullest  Common  datasets  (numeric/text  or  spatial)  will  be 
made  available  on  the  network,  as  will  various  software 
packages  (e.g.  Arc/Info). 

On  a  smaller  scale,  the  Data  Centre  itself  could  be 
networked  using  "mini-network"  packages  that  permit 
two  computers  to  seamlessly  "share"  data  and  software. 
This  could  prove  useful,  given  that  the  "old"  computer  is 
near  capacity  both  in  terms  of  hard  disk  space  and 


expandability. 

Service 

The  real  "lest"  of  the  service  component  of  the  Docu- 
ments Unit  GIS  satellite  will  come  in  September  when 
students  return.  Until  then,  only  limited  opportunities  are 
available  for  evaluating  the  most  effective  way(s)  of 
providing  service.  In  anticipation  of  demand,  however, 
plans  have  been  discussed  and  groundwork  laid. 

By  their  nature,  GIS  and  mapping  require  printed  output. 
The  costs  of  operating  an  HP  PaintJet  are  considerable 
(colour  toner  -  $39.00  Cdn,  paper  -  $24.00/200  sheets). 
As  such,  a  "fee-for-service"  policy  has  been  adopted, 
with  graph  or  map  output  costing  $1.00  per  page.  A 
separate  account  fund  was  estabUshed  along  the  lines  of 
the  existing  photocopier  system.  Determining  an  appro- 
priate price,  though  not  difficult,  did  take  into  considera- 
tion the  following: 

Keep  accounting  simple,  limit  need  for  "change" 

Don't  make  printing  so  inexpensive  that  users  will 
print  indiscriminately 

Don't  make  printing  so  expensive  that  users  will 
monopoUse  time  on  the  computer  to  get  their  map/ 
graph  "just  right" 

Price  print  output  to  cover  costs 

The  dual  nature  of  GIS,  combining  maps  and  data,  makes 
cooperation  between  the  Map  Library  and  the  Data 
Centre  essential.  Map  Library  staff  have  had  to  learn 
those  "GIS"  applications  available  at  the  satellite  site  and 
the  Data  Centre  Librarian  has  had  to  learn  more  about 
providing  "map"  reference  service. 

On  a  related  note,  the  prospects  for  increased  staffing  in 
the  Data  Centre  are  unclear.  Given  current  financial 
restraint,  creative  means  may  be  required  to  provide 
necessary  staffing.  As  GIS  and  other  data-related  initia- 
tives develop,  it  is  hoped  that  a  data  technician  position 
will  be  created. 

DiscussionlConclusions 

The  real  benefit  of  having  a  Documents  Unit  GIS 
Satellite  stems  from  the  diverse  range  of  resources  and 
expertise  this  GIS  site  can  draw  upon.  Combined  with 
the  Documents  Unit's  established  user  base,  this  location 
provides  an  ideal  service  point  for  emerging  and  merg- 
ing technologies. 

Given  the  early  stage  of  this  initiative,  prediction  is 
difficult.  To  date,  use  of  the  satellite  site  in  the  Docu- 
ments Unit  has  been  limited  to  several  demonstrations 


Fall/Winter  1993 


33 


and  [Hactice  sessions  using  E-STAT. 

E-STAT's  strength  is  its  ease  of  use.  A  relative  neophyte 
can  produce  a  colour,  CSD-level  map  in  under  five 
minutes.  E-STAT  was  introduced  to  groups  of  "enriched" 
high  school  students  who  were  given  the  opportunity  to 
"test  drive"  the  system.  The  students  found  the  program 
relatively  easy  to  use,  after  a  basic  ind-oduction.  For  users 
familiar  with  other  "geographic"  software  applications 
(e.g.  Adas*Graphics),  E-STAT  should  present  no  major 
problems. 

E-STAT's  weakness  lies  in  its  inflexibility.  While  users 
can  import  data  files  using  the  "data  manipulation" 
facilities,  user-defined  map  coordinate  files  cannot  be 
used.  File  specifications  for  map  files  are  not  provided. 
This  "closed  box"  design  does  not  permit  E-STAT  to  be 
used  to  its  full  potential. 

Statistics  Canada  is  conducting  a  survey  to  solicit  sugges- 
tions for  improving  E-STAT.  A  "new  and  improved"  E- 
STAT  wUl  be  released,  possibly  by  the  Fall  of  1992, 
based  on  these  suggestions. 

Once  the  academic  term  starts  in  September,  demand  is 
expected  to  increase.  The  availabiUty  of  a  sign-up  sheet, 
and  the  user-friendly  nature  of  the  present  software, 
should  assist  Data  Centre/Map  Library  staff  manage  this 
demand.  Key  to  success  will  be  the  continued  cooperation 
among  the  Data  Centre,  Map  Library,  and  CIS  Lab, 
working  together  to  meet  users'  needs.  With  this  coopera- 
tion, the  Documents  Unit's  CIS  Satellite  will  be  in  a 
position  to  provide  users  with  access  to  new  "GIS" 
hardware  and/or  software  as  demand  warrants  and 
resources  permit. 


1.  Paper  presented  at  the  lASSIST  Conference  1992, 
Madison,  Wisconsin:. Jeffrey  Moon:  Documents  Refer- 
ence/Data Centre  Librarian,  Social  Science  Data  Centre, 
Queen's  University,  Kingston,  Ontario,  Canada  K7L  5C4 
MOONJ(a)  qucdn  .queensu.ca 

2.  The  Documents  Unit  is  administered  by  the  Queen's 
Library  System  and  is  comprised  of  the  Government 
Documents  Library,  the  Map  and  Air  Photos  Library  and 
the  Social  Science  Data  Centre. 

3.  Geographic  Information  Systems  -  A  Guide  to  the 
Technology  John  C.  Antenucci,  Kay  Brown,  Peter  L. 
Croswell,  Michael  J.  Kevany.Van  Nostrand  Reinhold, 
New  York,  1991. 


34  lASSIST  Quarterly 


Election  Results  for  lASSIST  Offices 


Carolyn  Geda  has  passed  along  the  results  of  the  recent  election  for  lASSIST  offices.     We  wish  to 
thank  all  of  yaj  vdio  sutmitted  nominations  and  all  of  the  candidates  for  letting  their  names  stand. 
A  total  of  80  ballots  were  received. 

Ihose  elected  to  offices  aie: 

President  (2  Years)  ♦  Elizabeth  Staphenscn 

Vicer-President  (2  'ifears)  ♦  Peter  Bumhill 

R=rp'rral  Secrftnripet  ^Ifears; 

ABbalJa  ♦  Vance  Nferrill 

CSrecfa  ♦  Wendy  Watkins 

FlTtrpe  ♦  Bridget  Winstanley 

United  States  ♦  Laxira  Guy 

Ainmstratiwe  Cbmittee  (Msnters  at  large/  4  Ysars) 

CSnacfe  ♦  Gaetan  Drolet 

Bjirpe  ♦  Karsten  Rasmussen 

United  States  ♦  Margaret  Adams,  Ilona  Elinowski,    Janes  Jaccbs 


Those  ccrnpleting  two  more  years  of  a  term  on  the  Administrative  Camiittee  are:  Carmen  Can|±)ell 
(USA) ,  Vigdis  Kvalheim  (Europe) ,  Hilde  Colenbrander  (Canada) ,  Jean  Stratford  (USA) ,  and  JoAnn 
Dionne  (USA)  . 

A  special  word  of  thanks  to  those  ccrtpleting  their  four  year  term  on  the  Administrative  Ccrtmittee 
goes    to: 

Sue  Dodd    (USA) , 

Diane  Geraci  (USA) , 

Craig  McKie  (Canada)  and 

Ann  Janda    (USA) . 


o 


Fall/Winter  1993 


35 


BatchCensus  /  GetCensus:  1990  U.S.  Census,  Electronic  Data 
Distribution  Software. 


by  Fred  Nick' 

The  Center  for  Social  Science  Computation  and 

Research 

University  of  Washington 

Creating  a  New  Census  Distribution  Network 

Our  Washington  State  Census  Data  Distribution  Network  had  a  problem.  With  no  funding  and  diminishing  staffs,  how  do 
we  distribute  large  census  data  sets  to  a  hierarchical  network  of  organizations  in  a  timely  fashion?  Each  organization 
required  the  data  in  a  different  tape  format.  No  member  of  the  network  had  the  resources  to  make  conversions  and  copies 
for  other  members.  What  do  you  do?  Well,  we  created  software  that  could  automate  the  conversion  process.  Each  mem- 
ber logs  into  a  central  machine  and  runs  software  to  create  tapes  appropriate  to  their  own  data  needs  and  structures.  These 
tapes  are  then  mailed  to  them  automatically.  Each  staff  supplies  the  work  needed  to  create  a  unique  copy.  You  only  invest 
time  in  retrieving  data  in  which  you  have  a  direct  interest.  The  timing  of  the  distribution  is  independent  of  the  schedules 
of  the  other  member  groups.  The  system  works  quickly,  efficiently  and  with  few  operational  costs. 

In  early  1991,  our  state  (Washington,  USA),  began  reconstructing  a  State  Census  Data  Center.  In  1970  and  1980,  net- 
works formed  to  deal  with  the  distribution  of  US  decennial  census  materials.  The  Data  Center  or  Network  this  time,  is  a 
tree  structure  with  our  state  Office  of  Financial  Management  (OEM)  being  the  tip  of  the  tree.  We  called  it  the  Lead 
Agency  (LA).  Underneath  OEM,  are  seven  Coordinating  Agencies  (CAs)  forming  a  seven  "node"  second  tier.  Beneath 
each  Coordinating  Agency  are  the  clients  of  that  organization,  the  third  tier  of  the  tree.  Some  clients  are  very  active 
census  users  called  Affiliates.  Others  are  one  time  users  of  census  data.  It  therefore  was  assumed  that  data  would  be  sent 
to  the  tip  of  the  tree  and  distributed  down  its  branches. 

Most  of  the  Coordinating  Agencies  are  organizations  with  a  responsibility  to  distribute  data  to  a  specific  group  of  users. 
Two  Agencies  are  major  research  universities  with  large  faculty  and  student  client  bases  (the  University  of  Washington 
and  Washington  State).  Next,  there  were  two  four-year,  undergraduate  Universities  (Central  Washington  University  and 
Western  Washington  University).  The  last  Agencies  were  four  Regional  Councils  or  Government  Planning  Agencies 
(Puget  Sound  Regional  Council,  Intergovernmental  Resource  Center,  Spokane  Regional  Council,  and  the  Office  of  Long 
Range  Planning  for  the  City  of  Seattle) 

All  of  these  organizations,  including  OEM,  were  very  short  on  staffing.  The  mix  of  equipment  used  by  the  CAs  broke 
down  into  three  categories: 

♦  People  using  IBM  mainframes,  with  nine-track  EDCDIC  tape  drives. 

♦  People  using  VAX  minicomputers  with  nine-track  VAX  normal- 1 1  tape  drives. 

♦  Petq)le  using  MSDOS  family  PCs  with  nine-track  ASCII  tape  drives. 

Past  Methods 

In  the  past,  two  methods  were  used  to  distribute  census  data  by  computer  tape.  In  the  first  method,  the  Distributing 
Agency  Model  (figure  #1),  the  LA  receives  the  data  from  the  Census  Bureau.  The  LA  creates  numerous  copies  and  mails 
them  to  each  CA.  This  method  is  expensive  and  staff  intensive  for  the  LA.  The  distributor  often  sends  out  all  tapes  in  the 
same  format  and  the  CAs  must  deal  with  converting  the  data  to  a  format  they  can  use.  CAs  get  copies  of  the  data  auto- 
matically which  means  they  often  receive  data  they  do  not  need.  Tape  creation  and  distribution  costs  plague  the  distribu- 
tor and  expenses  for  conversion  burden  the  CAs.  The  time  it  lakes  to  distribute  each  round  of  data  can  extend  from  days 
to  weeks  depending  on  the  performance  of  the  tape  producing  agency. 

The  next  method  tried  in  our  state  was  the  Daisy  Chain  Method  (figure  #2).  The  Lead  Agency  gets  the  tapes,  makes  a 
copy  for  themselves  and  then  mails  the  tapes  to  the  first  CA  on  the  list  Each  CA  copies  the  tapes  and  forwards  them  to 
the  next  CA.  The  LA  only  makes  one  copy  reducing  staffing  and  media  costs.  The  first  copy  is  sent  out  quickly.  The  LA 
has  fewer  responsibilities  and  the  CAs  can  just  forward  unwanted  data  without  copying  it.  This  method  also  has  prob- 


36  lASSIST  Quarterly 


lems.  One  CA  can  cause  a  bottleneck  in  the  whole  distribution  process.  A  CA's  delay  in  converting  or  copying  a  tape  is 
EVERYONE'S  problem.  The  CAs  has  to  supply  the  employees  to  make  the  copies  from  the  originals  creating  additional 
problems.  Inexperienced  employees  often  damage  or  destroy  the  distribution  tapes.  A  new  copy  has  to  be  made  by  the 
LA  and  put  back  into  circulation.  CAs  still  have  to  convert  the  tapes  into  the  format  used  at  their  facilities.  "When"  you 
received  the  distribution  tape  is  always  an  unknown.  Delays  of  two  months  are  not  uncommon.  In  our  state's  use  of  this 
method  in  1980,  political  bickering  broke  out  between  CAs  over  who  was  to  be  at  the  front  of  the  distribution  chain. 

New  Method 

So  we  created  a  new  distribution  scheme  based  on  the  Distributing  Agency  Model.  It  includes  new  facets  meant  to  avoid 
the  problems  of  the  two  earlier  networks.  This  method  I  call  the  Vending  Machine  Model  (figure  #3.)  The  Census  Bureau 
sends  the  data  to  the  LA  as  befwe.  The  LA  creates  one  copy  and  sends  it  to  my  organization,  one  of  the  CAs.  We  devel- 
oped software  to  automate  the  tape  creation  step  for  the  rest  of  the  CAs.  To  avoid  having  one  site  supply  the  staffing  to 
create  the  tapes,  we  obtained  free  accounts  on  one  of  our  campus  machines  for  the  other  CAs.  We  trained  members  of 
their  staff  to  use  the  software.  We  informed  each  CA  over  e-mail  on  these  free  accounts,  when  new  data  was  ready.  Each 
CA  would  have  someone  log-on  and  run  the  software  to  create  a  tape  of  the  data  they  needed  in  a  format  appropriate  for 
their  computer  systems.  The  tapes  are  then  automatically  shipped  to  them  through  the  mail. 

There  are  still  negatives  in  this  system.  Each  CA's  staff  had  to  be  trained  to  used  the  software.  There  were  bugs  in  the 
initial  versions  of  each  software  product  to  discover  and  correct. 

GetCensus,  Project  Phase  One 

The  first  version  of  the  software  is  called  GetCensus.  It  is  interactive  and  menu-driven.  It  was  created  by  a  student 
programmer  at  a  salary  of  S6/hour  (£4/hour).  We  designed  it  to  create  nine  track  round  tapes  in  the  formats  needed  by  the 
CAs.  This  included  ASCII,  EBCDIC  and  VAXA'MS  tape  formats.  Two  tape  densities  are  supported:  1600  and  6250  CPI. 
We  built  the  software  to  take  advantage  of  existing  software  on  our  systems.  (Tape  management  routines,  tape  conversion 
utilities,  etc.)  We  then  turned  the  CAs  loose  on  the  software.  Several  flaws  emerged.  First,  the  user  had  to  wait  while  the 
tape  creation  was  in  progress.  Second,  the  users  had  to  do  the  arithmetic  necessary  to  figure  out  which  datasets  would  fit 
on  each  tape.  Last  but  not  least,  an  operator  at  the  computer  facility  we  were  using,(the-operator-from-hell),  had  an 
uncanny  knack  for  screwing-up  tapes  mounts  associated  with  this  project.  She  managed  to  cause  more  problems  then  all 
other  problems  combined!  Since  some  of  our  users  liked  the  interactive  version  as  it  was,  we  reconstructed  the  second 
version  into  a  somewhat  different  piece  of  software,  we  called  it  Batch  Census. 

BatchCensus,  Project  Phase  Two 

BatchCensus  had  the  same  features  as  GetCensus  with  the  flaws  corrected.  (All  except  the-operator-from-hell.  She  was 
still  lurking  in  the  background).  BatchCensus  builds  the  job  and  then  submits  it  to  the  background.  The  client  then  logs 
out  and  checks  later  to  see  if  the  run  was  completed.  The  job  notifies  the  user  by  e-mail  when  it  is  completed.  (It  also 
sends  us  a  complete  log  of  the  run  so  we  can  help  if  the-operator-from-hell  reappeared.)  The  new  software  calculates  the 
amount  of  tape  space  needed  by  your  run  and  organizes  the  files  to  fit  on  the  least  number  of  tapes.  We  also  added  a 
concept  called  a  "tapeset."  When  the  Tiger  mapping  files  were  released,  we  discovered  that  you  had  to  request  over  300 
files  by  name  when  using  GetCensus.  In  BatchCensus  you  can  request  them  an  entire  tape  at  a  time.  These  tapesets  often 
mirrored  the  organization  on  the  original  census  release  tapes.  When  we  set  up  GetCensus  we  received  a  gigabyte  of  disk 
space  on  the  computer.  All  new  data  goes  onto  disk  and  tape.  As  new  data  arrives  older  data  on  disk  is  removed,  leaving 
a  copy  on  tape  available.  BatchCensus  can  retrieve  from  disk,  tape  and  tapesets  in  the  same  session.  BatchCensus  is  very 
easy  to  use. 

The  Final  Analysis 

In  review  of  the  project  so  far,  we  have  satisfied  all  objectives  for  the  project.  Programming  costs  were  less  than  $500. 
(£335)  The  gigabyte  of  disk  space  and  computer  accounts  are  donated  to  the  project  by  our  campus  central  computer 
facilities.  My  organization  contributes  the  staff  time,  support  costs  and  project  overhead,  which  had  to  be  allocated  to 
serve  oiu"  own  clients,  in  any  case.  The  additional  responsibilities  taken  on  by  our  staff  were  to  be  recompensed  by  our 
receiving  additional  CD/ROM  versions  of  the  data.  The  Office  of  Financial  Management  reneged  upon  this  commitment 
which  was  the  most  disappointing  aspect  of  the  project 

Unforeseen  problems  occurred  during  the  project  First  was  the-operator-from-hell,  we  mentioned  previously.  Several  of 
the  smaller  Members  turned  to  the  CD/ROM  versions  of  the  data  because  they  worked  better  on  their  desk-top  machines 
than  the  nine  track  tape  drives  did.  We  had  one  individual  who  was  unwilling  to  change  to  use  the  new  system.  He 
preferred  to  do  things  exactly  the  same  way  he  had  for  the  past  ten  years,  even  though  the  new  system  was  much  faster 


FallAWnter  1993  37 


DISTRIBUTING  AGENCY  MODEL  used  in  distributing  1970  Data 


Census  =  the  U.S.  Bureau  of  the  Census 
LA  =  Lead  Agency  (Distributing  Agency) 
CA  =  Coordinating  Agencies 
O  =  Affiliates 


Figure#1 


38 


lASSIST  Quarterly 


DAISY  CHAIN  MODEL  used  in  distributing  1980  Data 


Census  =  the  U.S.  Bureau  of  the  Census 
LA  =  Lead  Agency  (Distributing  Agency) 
CA  =  Coordinating  Agencies 
O  =  Affiliates 


Figure#2 


Fall/Winter  1993 


DAISY  CHAIN  MODEL  used  in  distributing  1980  Data 


Census  =  the  U.S.  Bureau  of  the  Census 
LA  =  Lead  Agency  (Distributing  Agency) 
CA  =  Coordinating  Agencies 
O  =  Affiliates 


Figure#3 


40 


lASSIST  Quarterly 


and  less  troublesome  than  his  traditional  methods.  Another  Member  kept  revolving  new  people  into  the  staff  position  that 
was  responsible  for  using  BatchCensus.  This  caused  my  organization  to  continually  retrain  new  users. 

Of  the  original  members  only  half  continue  to  use  this  system.  One  uses  his  traditional  manual  system.  Two  have  con- 
verted to  CD/ROM  use  exclusively.  We  have,  however,  quadrupled  the  number  of  people  using  the  system  by  allowing 
20-30  campus  users  use  of  the  software  to  move  census  data  to  their  own  machines.  So  we  lost  three  members,  kept  four 
members  and  added  twenty  to  thirty  new  users! 

The  Last  Word 

If  we  were  asked  to  do  the  project  again,  would  we  accept  it  and  what  would  we  change?  Yes,  we  would  accept  the 
project  again.  We  had  a  good  time  developing  the  project  and  it  has  saved  us  many  manual  tape  copies.  It  also  greatly 
expedited  the  delivery  of  the  data  to  our  users.  The  vast  majority  of  our  faculty  and  staff  accessed  the  data  while  it  was 
still  on  disk  storage  and  avoided  tape  use  all  together.  The  things  we  would  change  are: 

♦  We  would  use  the  Internet,  if  possible,  for  some  deliveries. 

♦  We  would  make  Members  commit  a  single  individual  to  the  project  for  its  life  span. 

♦  Rearrange  the  Network  to  include  only  large  sites  which  can  deal  with  the  computer  tapes.  The  smaller, 
desk-top  sites  should  use  CD/ROM. 

♦  A  written  contract  should  include  all  responsibilities  of  the  Lead  Agency  and  the  Members  of  the 
Network,  and  include  penalties  for  not  complying  with  its  points. 

We  are  quite  satisfied  with  the  results  of  the  project  and  continue  toreap  the  benefits  of  the  software. 

1.  Presented  at  lASSlST/IFDO  93  Conference  held  in  Edinburgh,  Scotland.  May  1993.  Fred  Nick,  Director  The  Center 
for  Social  Science  Computation  and  Research  145  Savery  Hall  DK-45  University  of  Washington  Seattle,  WA  98195 
USA  Internet:  fred@u. washington.edu  (206)  543-81 10 


Fall/Winter  1993  41 


The  Need  for  Revised  Data  Documentation  Standards:  New 
Solutions  for  Old  Problems. 


by  Laura  A.  Guy' 

University  of  Wisconsin-Madison 

Abstract 

The  problems  currently  facing  archivists  concerning  data 
documentation  qualityare  not  new  ones.  Manuals  such  as 
Geda's  Data  Preparation  Manual  andRoistacher's  Manual 
for  Machine-Readable  Data  Files  and  Their  Documenta- 
tion have  been  on  our  shelves  for  many  years.  In  the 
1990's  new  difficulties  have  arisen  due  to  ever-shrinking 
budgets  that  prohibit  the  production  of  archive-  generated 
documentation,  an  increased  awareness  of  the  need  to 
follow  proper  bibliographic  standards  and  encourage 
proper  bibliographic  citation  of  data  by  users,  increasing 
pressures  by  granting  agencies  upon  producers  to  archive 
their  data  in  order  to  make  it  publicly  available  quickly, 
and  the  spreading  use  of  CAPI/CATI  programs  for  data 
collection.  Current  standards  as  specified  by  Geda, 
Collins/Powers,  et  al,  are  examined,  possible  revisions  are 
proposed,  and  suggestions  are  made  on  how  to  get  a  set  of 
new  standards  in  place  and  utilized. 

I  am  going  to  speak  about  something  I  have  come  to  call 
The  Public  Use  Process.  This  process  involves  making 
data  in-house  ready  for  use  in  secondary  analysis.  As  I 
define  it.  The  Public  Use  Process  insures  that  data  are,  in 
their  deal  state,  clean,  well-documented,  and  user- 
friendly.  The  opinions  expressed  here  are  totally  my  own, 
the  point  of  view  taken  being  that  of  a  small-scale  'local' 
archivist  To  prepare  for  this  paper  I  spoke  to  a  number 
of  data  creators  and  others  involved  in  producing  data  for 
public  use,  based  both  locally  and  at  the  federal  level.  I 
also  spdce  to  a  number  of  data  archivists^. 

The  major  difficulty  I  had  in  writing  this  paper  is  very 
personal  in  nature:  my  first  job  at  the  Data  and  Program 
Library  Service  was  to  assist  in  the  documentation  of 
archival  studies.  I  spent  many  long  hours  writing  "per- 
fect" abstracts,  introductions,  and  descriptions  of  method- 
ology, as  well  as  badgering  reluctant  P.I.s  to  part  with 
information  necessary  to  complete  the  archival  processing 
of  their  data.    I  disliked  the  work,  and  I  compare  it  to  the 
onerous  task  of  documenting  programs:  nobody  likes  to 
do  iu  I  therefore  come  to  this  subject  with  a  built-in  bias. 


PARTI 

/.  Problems 

II.  Issues 

III.  Where  We  Are,  Where  We've  Been 

IV.  Recommendations 

I.  Problems 

The  lot  of  the  data  archivist  is  not  an  easy  one.  Those  of 
us  responsible  for  the  day  to  day  operation  of  a  data 
library  face  a  set  of  special  problems  when  dealing  with 
our  archival  responsibilities.    Some  of  these  problems 
are  technology-based,  concerning  the  storage  and 
retrieval  of  data,  but  others  have  to  do  with  people. 

Some  of  these  problems  include: 

Problem  1.  Identification/Location 

We  need  to  identify  potential  data  sources:  What  data 

are  out  there?  Who  is  collecting  them? 

Problem  2.  Cooperation/Negotiation 

Does  the  researcher  agree  to  archive  the  data?  Will  there 

be  restrictions  placed  on  them? 

Problem  3.  Validation 

Are  the  data 'healthy'?  What  media  are  they  on?  Is  the 
media  reasonable?  How  are  the  data  organized?  In  what 
format  are  they  written?  Are  they  raw  ASCII  files? 
System  files?  Lotus?  DBase? 

Problem  4.  Documentation 
Are  the  data  documented?  To  what  extent  are  they 
documented?  What  format  is  the  documentation? 
Hardcopy?  Machine-readable?  Wordprocessor  specific? 

This  last  problem  can  be  the  most  frustrating  one  for  the 
archivist;  IF  you  make  it  through  hurdles  1,  2  and  3, 
problem  4  can  stop  you  in  your  tracks.  All  of  us  have 
our  own  nightmarish  stories  about  data  lost  forever 
because  the  documentation  was  inadequate  or  non- 
existent. 

II.  Issues 

Although  all  of  the  above  problems  are  of  some  interest, 
it  is  the  fourth  that  concerns  us  here.  Allow  me  to  make 
a  couple  of  generalizations  about  documentation: 


42 


■ASSIST  Quarterly 


Generalization  1.  Data  have  no  value  without  adequate 

documentation. 

There  are  many  possible  reasons  for  the  generally  poor 

condition  of  data  documentation: 

Reason  1.  Lack  of  appropriate  funding  to  produce  a 
quality  product 

Funding  agencies  routinely  cross  out  budget  entries  that 
allow  for  the  creation  of  public  use  files.  Examples 
include  requests  for  funds  to  hire  archival  staff,  and 
money  for  producing  documentation,  microfiching  the 
questionnaires,  and  photocopying. 

Reason  2.  Lack  of  quality  control  procedures 
implemented  tiiroughout  the  life  of  the  project. 

Many  projects  do  not  concern  themselves  with  The  Public 
Use  Process  until  the  end  of  the  grant  period.  Many  of 
the  decisions  made  that  have  a  direct  affect  on  the 
development,  content,  and  quality  of  the  data  are  forgot- 
ten, proper  documentation  has  not  been  kept,  important 
staff  have  left  and  information  lost 

Reason  3.  Lack  of  available  standards  and  guidelines. 

Investigators  are  often  ignorant  of  the  availability  of 
sources  that  might  assist  them.  These  not  only  include 
manuals  such  as  Geda,  etc.,  but  human  resources  like 
local  data  archivists/librarians/ICPSR  members/resource 
people  at  granting  agencies,  and  so  forth. 

Reason  4.  Lack  of  available  expertise. 

Researchers  who  have  not  been  involved  in  data  collec- 
tion projects  often  don't  have  the  past  experience  to  solve 
some  of  the  problems  described  above.  Forexample,  they 
are  not  familiar  with  the  technique  of  padding  their 
budget  to  allow  for  production  of  documentation. 

Reason  5.  Non-recognition  of  the  value  of  secondary 
data  sources  for  scientific  research  and  public  policy 
planning. 

Researchers  are  often  short-sighted  or  overly  custodial. 
They  don't  realize  that  others  might  be  interested  in  their 
data,  or  they  don't  want  to  make  them  accessible  before 
they  have  completely  'mined'  them — and  often  by  then 
information  essential  to  understanding  the  data  have  been 
lost  as  has  any  motivation  to  complete  The  Public  Use 
Process. 

Generalization  2.  The  ever-increasing  value  of 
secondary  data  leads  to  their  improved  utilization.  Thus, 
good  documentation  becomes  more  and  more  valuable. 
Good  documentation  benefits  many  by  serving  numerous 


purposes: 

Reason  1 .  Future  use 

Data  documentation  is  the  key  to  understanding  its 
quality.  The  text  accompanying  a  data  file  enables  the 
conceptual  framework  of  the  file's  creatws  to  be  under- 
stood. It  describes  the  development  and  content  of  the 
data. 

Reason  2.  Quality  control 

Good  documentation  contributes  not  only  to  the  future 
use  of  the  data  but  may  enable  future  audits  of  data 
quality  and  the  operations  that  produced  it.  It  can  also 
facilitate  replication  of  the  study  at  a  later  data. 

Reasons.  Bibliographic  control 

The  quality  of  the  documentation  also  affects  the  archi- 
vist's work  because  it  supplies  information  for  proper 
bibliographic  control.  Bibliographic  control  is  one  way 
that  we  can  encourage  good  citation  of  data.  This,  in 
turn,  insures  proper  indexing  by  bibliographic  services 
and  facilitates  the  scholarly  process  by  providing  other 
researchers  with  sufficient  information  to  acquire  the 
data. 

in.  Where  are  we,  Where  We've  Been 

As  we  move  towards  making  recommendations  for  the 
future,  it  may  be  instructional  for  us  to  look  briefly  at 
issues  we've  dealt  with,  as  a  profession,  in  the  past 

1981:  What  were  the  perceived  problems? 
Twelve  years  ago  Robbin,  speaking  in  a  plenary  session 
of  an  1 ASSIST/IFDO  joint  conference',  identified  the 
major  problem  facing  our  profession  as  the  under- 
utilization  of  data,  and  outlined  three  areas  that  contrib- 
uted to  the  situation: 

Problem  1.  Lack  of  quality  data. 

Data  were  commonly  found  to  be  of  "low  quality"  or 
erroneous.  Because  production  was  removed  from  use, 
producers  (such  as  contractors)  lacked  an  understanding 
of  how  the  data  would  be  used  and  what  information  was 
needed  to  use  tiiem  properly. 

Problem  2.  Lack  of  coordination  of  data  resources, 
particularly  poor  information  management  by 
government. 

Poor  records  management,  administrative  "tunnel 
vision",  and  technological  complications  resulted  in  the 
loss  of  data  and  its  associated  documentation. 

Problems.  Lack  of  understanding  the  utility  of  data 


FallAVinter  1993 


for  research  and  policy. 

Administrative  activity  tended  to  reduce  the  incentive  for 
"backward"  looks,  resulting  in  a  throw-away  mentality. 
Thus,  data  such  as  the  1940  and  1950  U.S.  Censuses  of 
Population  and  Housing  had  to  be  recreated  from  micro- 
fiche at  a  cost  of  millions  of  dollars. 

1993:  How  have  things  changed? 
Organizations  such  as  lASSlST,  IFDO,  APDU,  and 
scores  of  dedicated  professionals  have  been  instrumental 
in  moving  us  forward  over  the  last  decade.  The  use  of 
data  has  become  an  integral  part  of  the  education  proc- 
ess, and  numeric  data  are  being  incorporated  into 
mainstream  libraries  as  part  of  depository  and/or  CD- 
ROM  collections: 

Opinion  1.  I  don't  believe  that  we  need  to  convince 
people  of  the  value  of  data. 

Note  the  existence  of  the  National  Data  Archive  on  Child 
Abuse,  the  Data  Archive  on  Adolescent  Pregnancy  and 
Pregnancy  Prevention,  and  the  ICPSR's  continued 
growth.  Also  look  at  recent  U.S.  legislative  attempts  like 
the  "GPO  Gateway  to  Government  Act  of  1992",  which 
was  to  provide  "public  access  to  a  wide  range  of  govern- 
ment electronic  databases"*.  Data  have  become  more 
common,  instead  of  less.  As  an  acknowledgement  of  the 
value  of  secondary  analysis  there  is  increasing  effort  on 
the  part  of  granting  agencies  to  require  that  data  be  made 
public. 

Opinion  2.  I  don' t  believe  that  data  resources  are 
under-utilized. 

Look  at  the  increasing  number  of  conferences  such  as  the 
NCHS/CDC  Public  Health  Conference  on  Records  and 
Statistics,  workshops  in  Essex  and  Ann  Arbor,  the 
Census  Data  Users  Conferences,  the  National  Data 
Archive  on  Child  Abuse  and  Neglect  Summer  Institute, 
and  the  multitude  of  university  courses  that  teach  secon- 
dary analysis  methodology  and  techniques,  etc.  Things 
have  improved  in  the  last  decade.  However: 

Opinion  3.  I  do  believe  that  people  are  not  paying 

attention  to  the  standards  that  have  been  set  in  the  past. 

We  have  to  ask  ourselves  "WHY?" , 

AND: 

Opinion  4.  I  do  believe  that  there  is  still  widespread 

mismanagement  of  information  (note  the  near-loss  of  the 

World  Fertility  Surveys),  but  that  this  situation  is 

continuing  to  improve  as  vigilance  becomes  more 

widespread. 

As  most  of  us  know,  many  social  science  funding  agen- 
cies now  mandate  the  depositing  of  data  into  an  existing 
archive.  The  United  States  Government  continues  to 
make  data  available  through  its  own  (adequate  or  not) 


departments.  Dr.  Eric  Lang  of  Sociometrics  Corporation 
states  that  government  data  from  sources  such  as  the 
Center  for  Disease  Control  and  the  National  Center  for 
Health  Statistics  are  "fairly  systematic"  and  produce  few, 
if  any,  archival  problems.  More  frequently  it  is  the 
"individual  researcher"  who  is  the  source  of  difficulties. 
Such  researchers,  whose  projects  are  often  funded  by 
smaller  university  grants  or  other  non-federal  sources, 
may  be  great  scholars  in  their  fields  but  are  not  great  data 
documenters. 

IV.  Recommendations 

Issue  1.  As  archivists,  how  can  we  promote  The  Public 

Use  Process? 

Suggestion  1:  We  can  encourage  that  language  be 
included  in  the  RFP  (Request  for  Proposals),  or  in  the 
grant  notification  document,  that  establishes  how  soon 
public  use  data  should  be  made  available,  specifies 
what  materials  are  required  for  archiving,  and  ouUines 
the  need  for  proper  planning  for  both  data  quality  and 
adequate  documentation.  We  can  work  with 
institutions,  such  as  major  research  universities  where 
the  grants  are  written  and  where  major  data  collection 
efforts  take  place,  to  ensure  that  data  collected  by 
affiliated  researchers  are  identified  and  deposited  in 
archives.  In  this  way  granting  agencies  can  pressure 
researchers  to  make  sure  that  data  are  ready  to 
archive,  and  hopefully  provide  archives  with  more 
clout  than  just  pleading  with  producers:  "Oh  please 
won't  you  deposit  your  data?",  "Oh  please  won't  you 
document  it  properly?" 

Suggestion  2:  We  need  to  do  everything  possible  to 
facilitate  The  Public  Use  Process.  For  example  we 
can  become  more  flexible  in  our  acceptance 
requirements  by  allowing  data  to  remain  in  the  form 
in  which  it  was  archived  and  making  access  rather 
than  standardization  (i.e.,  the  "plain  vanilla"  data 
process)  a  priority.  This  may  be  a  radical  departure 
from  the  way  many  of  us  think,  and  how  we  define  the 
term  "providing  access".  We  need  to  be  careful  of  the 
demands  we  make  on  producers  and  take  care  to  keep 
guidelines  simple,  straightforward,  and  realistic. 

Issue  2.  Poor  documentation  is  a  if  not  the  primary 
hindrance  to  the  archival  acceptance  of  data. 
Encouraging  the  development  of  and  adherence  to 
proper  standards  is  something  that  organizations  such  as 
lASSIST  can  and  should  take  part  in: 

Suggestion  1:  We  can  encourage  granting  agencies  to 
not  strike  out  budget  items  for  data  documentation, 
archival  staff,  etc.,  and  make  suggestions  as  to  what 
sorts  of  items  must  be  included  to  insure  a  proper 
level  of  documentation. 

Suggestion  2:  We  can  encourage  producers  to  think 


44 


lASSIST  Quarterly 


about  The  Public  Use  Process  throughout  the  life  of 
the  project 

Suggestion  3:   We  can  encourage  that  the  producers 
be  part  of  the  documentation  process  and  that 
documentation  be  treated  as  a  Real  Document, 
prepared  at  the  time  of  the  data's  creation,  rather  than 
something  that  programmers  produce  as  an 
afterthought.  This  might  be  a  way  to  avoid  problems 
such  as  those  presented  by  the  National  Supported 
Work  Demonstration  Project,  where  Mathematica 
received  $50,000  clean  up,  document  and  produce  a 
public  version  of  the  data  long  after  their  creation. 

Suggestion  4:  We  can  encourage  the  adoption  of  a 
professional  attitude  concerning  documentation  and 
the  development  of  organizational  commitment,  as 
well  as  institutional  support,  for  the  effort — it  is  not 
creative  work,  fascinating  work,  exciting  work;  no  one 
really  wants  to  do  it  but  it  is  necessary! 

Suggestion  5:  By  clarifying,  establishing  and 
publicizing  guidelines  that  are  reasonable  and 
adequate,  we  can  encourage  The  Public  Use  Process 
and  maybe  make  everyone's  lives  a  bit  easier. 

PARXn 

/.  Current  Standards 

II.  Changes  Needed 

III.  The  Need  for  Balance 

IV.  How  to  make  it  work. 

I.  Current  Standards 

The  following  are  four  current  "standards"  (not  an 
inclusive  list,  but  commonly  acknowledged): 

Carolyn  L.  Geda,  Data  Preparation  Manual  (1980). 

Richard  Roistacher,  A  Style  Manual  for  Machine-Read- 
able Data  Files  and  Their  Documentation  (1980). 

U.S.  Department  of  Justice.  Bureau  of  Justice  Statistics, 
Technical  Standards  for  Machine-Readable  Data  (nd). 

Patrick  Collins  and  Jane  L.  Powers,  The  Preparation  of 
Data  Sets  for  Analysis  and  Dissemination:  Technical 
Standards  for  Machine-Readable  Data  (1991). 

Geda 

Reflecting  the  time  in  which  it  was  written,  this  manual 
talks  about  tapes.  The  author  gives  some  indication  of 
'levels'  of  description  and  denotes,  through  the  use  of 
asterisks,  high-priority  items  that  she  feels  are  absolutely 
required  to  understand  and  use  the  data.  The  manual  is 
fairly  straightforward  and  simple,  and  promotes  some 
degree  of  flexibility.  She  provides  a  checklist  that  is  three 


pages  long,  which  is  intended  to  act  as  an  inventory  that 
assists  in  systematizing  work  steps.  She  provides  eight 
documentation  sections: 

A.  General  Identification  or  Information  Control  Ele- 
ments (title,  authorship,  agency,  grant  number,  edition) 

This  section  includes  elements  that  are  required  to 
properly  identify  the  survey  as  well  as  the  agencies  and/or 
individuals  responsible  for  its  collection.  Appropriate 
unique  identifying  information  to  prevent  confusion  and 
assist  in  jwoper  citation. 

B.  Disclaimer 

This  section  insures  that  collecting  agencies  and/or 
individuals  are  not  held  responsible  for  secondary  use  of 
the  data  and  for  interpretations  and  findings  repwted  by 
secondary  users. 

C.  Bibliographic  Citation 

Provides  examples  of  how  the  data  and  documentation 
should  be  cited. 

D.  Feedback  from  Users 

This  section  requests  that  users  submit  copies  of  their 
work  based  on  the  use  of  the  data  and  asks  that  copies  be 
sent  to  the  distributing  organization. 

E.  Methodology  and  Findings  (survey  objectives, 
abstract,  survey  design,  etc)^ 

This  section  provides  substantive  information  required  to 
evaluate  the  use  the  study. 

F.  Technical  Infomiation  (missing  data,  error  reports, 
data  and  file  structure) 

This  section  documents  the  type  and  extent  to  which  the 
data  were  cleaned  and  checked,  and  infwms  users  about 
the  structure  of  the  file. 

G.  List  of  Variables 

A  list  in  summary  form  is  cited  as  a  useful  aid. 

H.  Codebook  (frequencies,  instrument) 

Roistacher  et  al 

Again  tape -oriented.  It  includes  lots  of  examples.  The 
manual  is  very  verbose,  and  complicated  in  appearance. 
It  describes  itself  as  a  'minimal'  standard.  He 

provides  for  five  sections  in  his  user's  guide,  and  his 


Fall/Winter  1993 


45 


checklist  is  twelve  pages  long! 

A.  "Preliminaries"  (bibliographic  attributes,  title,  author, 
producer,  distributor,  edition  statement,  title  page,  etc.) 

Establishes  Che  bibliographic  identity  of  the  file. 

B.  History  of  Originating  Project  (coverage,  collection 
methods,  publications) 

Provides  a  short  history  and  "evolution"  of  the  data  file. 

C.  File  Processing  Summary  (archiving  history,  archival 
file  description) 

Editing  strategies  and  general  processing  history. 

D.  Data  Dictionary  Description  and  Listing  (codebook) 

E.  Appendices  (extended  coding  lists,  instrument) 

This  includes  bulky  information  that  would  otherwise 
break  up  the  continuity  of  the  main  section  of  the  user's 
guide. 

Bureau  of  Justice  Statistics 

This  is  the  shortest  of  the  four  documents,  more  space  is 
given  to  information  concerning  the  proper  coding  of 
data  and  organization  of  files  than  to  documentation 
standards.  It  simplifies  the  documentation  process  into 
four  parts: 

A.  Abstract  (bibliographic  citation,  general  description 
of  universe,  subject  matter,  technical  descriptions,  related 
publications) 

B.  Codebook 

C.  Tape  table  of  contents 

D.  Frequency  tables 

"The  minimal  documentation  of  a  data  file  consists  of  a 
tape  volume  table  of  contents,  a  character  and  octal  or 
hex  dump  of  a  sample  of  records  of  each  file,  informa- 
tion sufficient  to  construct  a  bibliographic  citation  to  the 
machine-readable  data  file,  an  abstract  of  the  file's  form 
and  content  and  a  minimal  codebook".  (pgl  1) 

Collins/Powers 

This  draws  heavily  from  the  BJS.  Significantly,  it  makes 
references  to  different  "types  of  files"  such  as  S  AS, 
SPSS  system  files,  and  so  forth.  It  is  primarily  tape- 
based,  as  are  the  others. 

A.  Abstract 


B.  Data  tape  documentation 

C.  Codebook 

D.  Frequency  Tables 

General  Overview: 

All  four  are  very  similar  in  that  they  ask  that  data  be 
sufficiently  identified  and  adequately  described,  but  they 
vary  in  levels  of  exactness.  Most  agree  on  the  salient 
points  of  what  is  considered  "minimally  acceptable", 
although  Geda  is  the  only  one  who  expands  on  the  idea  of 
estabUshing  different  "levels"  of  description  by  specify- 
ing "minimal"  standards.  All  give  advice  on  data  produc- 
tion such  as  coding  standards  and  the  treatment  of 
missing  data,  as  well  as  provide  guidelines  for  archiving 
data  (this  is  sensible  as  the  two  are  inextricably  con- 
nected). Most  of  the  requirements  are  still  quite  vaUd. 
The  checklist  that  Geda  provides  could  be  used  today, 
with  some  minor  modifications.  Roistacher  tends  to  be 
over-verbose,  and  appears  quite  inhibiting.  The  work 
involved  could  appear  too  burdensome,  and  it  may  scare 
people  off. 

//.  Changes  Needed 

We  need  to  promote  proper  documentation  by  providing 
simple  and  straightforward  guidelines;  re-establish  what 
we  consider  "minimum"  standards;  offer  guidelines  for 
differing  levels  of  documentation;  and,  update  standards 
to  deal  with  new  media  and  new  data  formats.  At  the 
same  time  we  need  to  acknowledge  the  existence  of 
differing  individual  requirements  for  both  the  data 
themselves  as  well  as  the  organizations  that  store  them 
(ICSPR  might  theoretically  have  different  requirements 
than  a  local  archive  such  as  DPLS). 

New  media 

Looking  at  the  recent  discussions  on  various  listservers, 
and  according  to  my  own  experience,  there  doesn't  seem 
to  be  current  agreement  on  a  single  data  transmittal  and/or 
storage  format. 

New  formats 

Many  software  packages,  for  example,  SAS,  advertise 
themselves  as  data  management  tools.  Indeed,  large  data 
collection  efforts  such  as  the  National  Medical  Expendi- 
ture Survey  (NMES)  use  SAS  macros  to  generate  their 
codebooks,  frequencies,  clean  data  and  produce  final 
public  use  data  files.  We  see  data  in  many  other  formats 
such  as  DBASE,  Lotus,  and  SPSS  export  files. 

New  data  collection  techniques 
The  Computer  Assisted  Interview  products,  often  called 
CAPl  (Computer  Assisted  Personal  Interview)  and  CATI 
(Computer  Assisted  Telephone  Interview),  have  notori- 
ously bad  "back-ends".  The  documentation  they  produce 


■ASSIST  Quarterly 


is  barely  human-readable,  if  at  all.  Obviously  the  compa- 
nies have  been  spending  their  efforts  on  developing  the 
front-ends  of  the  software. 


Balancing  act  2.  Producers  and  archivists  must  balance 
user  friendliness  against  possible  cost  savings  and  the 
increasing  complexity  of  data  and  documentation. 


'New'  archival  environment 

Many  of  us  are  dealing  with  an  ever-expanding  universe 
of  diverse  data  types.  We  are  seeing  for  the  first  time 
image  data,  sound  data,  and  text  data.  We  need  to  strive 
to  insure  that  any  recommendations  we  make  are  appli- 
cable to  these  less  "traditional"  forms  of  data  as  well  as 
"hard  science"  data. 

m.  The  Need  for  Balance 

Many  of  the  changes  described  above  obviously  have  to 
do  with  technological  advances.  Changes  in  the  way 
data  are  stored  (media),  formatted  (software)  and  col- 
lected (CATI,  CAPI)  have  had  tremendous  effects  on 
The  Public  Use  Process.  We  need  to  be  prepared  to  deal 
with  the  fact  that  data  are  no  longer  primarily  tape-based, 
nor  can  we,  I  believe,  force  out-of-date  standards  on 
others.  Indeed,  as  of  this  time,  there  seems  to  be  no  real 
consensus  as  to  what  is  the  "standard"  or  "perfect" 
format.   We  must  be  flexible  in  what  we  accept,  with  the 
understanding  that  this  flexibility  may  put  a  burden  NOT 
only  on  us  as  archivists,  but  on  our  users  who  will  have 
to  understand  that  the  data  they  want  to  use  may  take 
some  effort  on  their  part.  We  must  examine  the  trade-off 
between  losing  data  against  making  data  available  that 
are  not  perfectly  documented,  in  a  vanilla  wrapper,  or 
that  may  even  be  of  suspect  quality.  Jean  Stratford 
makes  the  argument  that  there  may  be  value  in  archiving 
data  that  are  "poorly"  done  so  that  others  may  find  that 
out  for  themselves,  and  learn  from  the  mistakes  made. 

We  must  acknowledge  that  much  of  The  Public  Use 
Process  is  a  balancing  act: 

Balancing  act  1 .  Producers  must  balance  the  timeliness 

of  data  release  against  quality  of  data  and 

documentation. 

Example:  The  1977  National  Medical  Expenditure 
Survey  files  were  absolutely  "pristine"  but  the 
preparation  process  took  too  long  and  the  time  lag 
would  not  be  acceptable  now.  A  recent  user  survey 
by  the  Department  of  Health  and  Human  Services 
concerning  the  National  Medical  Expenditure  Survey 
cites  data  currency  and  timeliness  of  release  as  a 
major  issue.' 

Example:  The  current  problem  with  the  1990  U.S. 
Census  of  Population  and  Housing  PUMS:  what 
happens  when  the  pressure  to  release  data  quickly  is 
so  severe  that  quality  is  jeopardized,  and  what  are  the 
costs  involved  in  releasing  such  data  versus 
withholding  them? 


Example:  The  National  Survey  of  Families  and 
Households  opted  to  produce  voluminous  paper 
documentation,  at  great  cost,  because  of  the  extreme 
complexity  of  the  data.  In  order  to  avoid  necessitating 
complex  linking  procedures  they  have  opted  to  create 
simple  to  use  flat  files  that  store  data  inefficiently  but 
make  users'  lives  much  easier — but  again  at  great 
cost  These  were  conscious  decisions  made  on  the 
basis  of  experience  and  empirical  testing.  (For 
example,  they  tried  incorporating  the  thousands  of 
skip  patterns  into  the  codebook  and  it  looked  terrible, 
so  they  removed  them  and  created  a  separate  section). 

Example:  The  recent  survey  of  users  done  by  the  U.S. 
Department  of  Labor  for  the  Consumer  Expenditure 
Surveys  cites  that  a  significant  number  of  users  had 
difficulty  linking  the  data  for  a  consumer  unit  across 
all  the  data  files,  and  others  complained  they  had 
problems  manipulating  the  quarterly  data  files  to 
derive  annual  estimates.  These  responses  led  the 
department  to  consider  the  creation  of  annual  files,  as 
well  as  reorganization  of  data  for  household  level 
analysis.' 

Example:  The  documentation  for  Charging  and 
Sentencing  of  Murder  and  Voluntary  Manslaughter 
Cases  in  Georgia  1973-1979.  This  is  one  of  the  worst 
examples  of  data  documentation  that  I  have  ever  seen. 
The  documentation  is  awful!  There  is  no  codebook  as 
such,  just  confusing  SAS  procedures.  The  text 
sections  are  indecipherable.  Terrible  variable  names 
are  used. 

Example:  The  data,  documentation,  and  User's 
Guides  produced  by  the  Sociometrics  Corporation  are 
at  the  oUier  end  of  the  spectrum.  They  reflect  the 
quality  that  can  be  achieved  when  the  aim  is  fOT 
maximum  user  friendliness  of  data  and 
documentation.  The  documents  accompanying  their 
studies  are  the  closest  thing  to  archival  works  of  art  1 
have  seen.  Narrative  sections  would  seem  to  have 
been  based  upon  standards  set  in  the  early  manuals, 
but  extraordinary  effort  is  made  to  document 
individual  variables  to  promote  ease  of  use.  Effort  is 
made  to  further  facilitate  use  of  the  data  by  offering 
numerous  media  options  and  retrieval  software. 
While  there  are  producers  1  have  spoken  with  who 
question  the  huge  amount  of  effort  that  goes  into 
Sociometric's  processing,  particularly  the  naming  of 
all  variables  by  subject  topic  and  type,  there  is  little 
doubt  in  my  mind  that  there  are  few  user  complaints 


Fall/Winter   1993 


(except,  perhaps,  concerning  cost). 

Balancing  act  3.  archivists  must  balance  their  'needs' 
against  what  is  reasonable  and  realistic  to  ask  of  the  PJ. 


There  is  a  need  to  develop  a  policy  of  'minimal'  interven- 
tion and  'minimum'  requirements.  How  do  we  minimize 
data  loss  while  maintaining  an  adequate  degree  of 
standardization?  EX)  we  need  to  consider  how  these  data 
will  be  used?  If  they  will  be  used?  And,  who  will  use 
them? 

IV.  How  to  make  it  work? 

It  is  clear,  as  Sue  Dodd  stated  in  her  call  for  contributions 
to  this  session  that  we  must  target  data  creators,  producers 
and  agencies.  We  must  also  target  ourselves  because  we 
have  important  decisions  to  make  concerning  our  priori- 
ties. We  must  balance  our  needs  against  those  of  data 
producers  and  users,  and  we  must  come  to  a  consensus 
among  ourselves  as  to  what  constitutes  The  Pubhc  Use 
Process  and  how  we  can  best  promote  it. 

This  obviously  will  involve  revising  the  old  manuals,  and 
the  creation  of  simple  guidelines  that  take  into  account 
not  only  those  old  standards  that  are  still  valid,  but  also 
recent  complications  caused  by  new  media  and  new  data 
formats.  This  re-evaluation  process  should  involve  data 
producers  and  users,  as  well  as  a  diverse  group  of  archi- 
vists and  data  librarians  drawn  from  local  archives  such 
as  DPLS,  higher-level  archives  such  as  the  ICPSR,  ESRC 
and  Steinmetz,  and  "cwporate"  archives  such  as  Sociom- 
etrics.  Perhaps  traditional  archivists  should  be  involved 
also  for  the  special  points  of  view  they  may  have  to  offer. 

In  addition,  organizations  such  as  lASSIST,  IFDO  and 
APDU  can  lobby  and  work  with  granting  agencies, 
research  organizations,  and  the  companies  that  produce 
CAPlandCATI  products. 

In  this  paper  I  did  not  intend  to  come  up  with  a  new  set  of 
guidelines  to  replace  the  old  ones.  I'm  not  even  sure  that 
the  old  ones  are  so  out-of-date.  As  we  have  seen,  some 
revisions  are  necessary,  but  they  will  not  be  difficult  to 
make.  While  technological  issues  are  involved  in  these 
changes  (package  dependency  of  machine-readable 
documentation,  image  files,  CATI  and  CAP!),  the  prob- 
lem is  primarily  a  "people"  one.  The  task  at  hand  is  to 
succeed  where  in  the  past  we  have  not:  getting  our 
message  across  to  data  creators,  producers  and  the 
granting  agencies  to  ensure  that  quality  data  and  docu- 
mentation are  produced  and  archived. 

1.  Presented  at  lASSIST/IFDO  93  Conference  held  in 
Edinburgh,  Scotland.  May  1993. 


2. 1  would  like  to  thank  Vaufhn  Call,  Patrick  Colins,  Eric 
Lang,  Ivy  Piliavin,  Debbie  Potter,  Jane  Powers  and  Jean 
Stratfwd  for  their  substantial  contributions  to  this  paper. 

3.  Robbin,  Alice.  Three  Reasons  for  the  Underutilization 
of  Social  Science  Data  Services  in  the  Information  Age. 
September,  1981. 

4.  S.  2813.  Introduced  during  the  102nd  Congress;  2nd 
Session  by  Senators  Gore,  Ford,  Sarbanes  and  Simon.  "A 
Bill  to  establish  in  the  Government  Printing  Office  an 
electronic  gateway  to  provide  public  access  to  a  wide 
range  of  Federal  databases  containing  public  information 
stored  electronically". 

5.  Most  of  the  information  used  in  this  sction  was  taken 
from  the  directives  outlined  in  the  U.S.  Department  of 
Commerce  Statistical  Policy  Handbook,  pubUshed  by  the 
Office  of  Federal  Stattistical  Policy  and  Standards  in 
May,  1978. 

6.  U.S.  Department  of  Health  and  Human  Services. 
Office  of  the  Inspector  General.  The  National  Medical 
Expenditure  Survey.  February,  19983.  p.  ii. 

7.  U.S.  Department  of  Labor.  Bureau  of  Labor  Statistics. 
Consumer  Expenditure  Data  User  Survey  Results 
(Report  832).  March,  1993.  p.7. 


48 


lASSIST  Quarterly 


'.RE  ARE  THL 


1993  lASSIST  CONFERENCE 

Edinburgh,  Scotland 
May  1  r- 1 4,  1 993 


Ga^tan  Drolet 

T6I.:  (418)656-7970 

Fax:  C41 8)656-3048 

GDROLET@CAMPUS.ULAVAL.CA 

Biblioth^que 

UNIVERSITELAVAL 

QUEBEC 


Fall/Winter  1993 


STARTING  A  DATA  LIBRARY  SERVICE: 

WHERE  ARE  THE  USERS? 


1  Background 

1.1  External  factors 

1.2  Internal  factors 


2  Why  conduct  a  survey? 

2.1  Wtio  are  the  users? 

2.2  Where  are  the  users? 

2.3  How  do  they  work  with  data? 


3h         Methodology 

i1        Population 
3.2^       Questionnaire 


4  Highlights  of  the  survey  of  users'  needs 

4.1  Use  of  data  by  researchers 

4.2  Sources  of  data 

4.3  Technical  support  '' 

4.4  Financing 

4.5  Personal  data 


5^ :  After  the  survey 

Sft  Feedback  to  the  respondents 

5.2  Report  to  the  library  administration 

5.3  Proposal  to  the  Computing  Committee 

6  Conclusion 


■ASSIST  Quarterly 


1    Background 

Before  discussing  specific  aspects  of  the  survey  carried  out  ir\  my  ii\stitution,  I  would 
like  to  present  the  general  context  in  which  Quebec  universities  found 
themselves  with  respect  to  computer  files  at  the  time  of  the  survey. 


1.1  External  factors 

In  1989,  CARL  (Canadian  Association  of  Research  Libraries)  created  a  consortium  to 
negotiate  a  collective  acquisition  of  the  Canadian  census  on  tapes  for  universities. 

At  that  time,  27  Canadian  institutions  acquired  the  census.  For  20  of  us,  it  was  the  first 
acquisition  of  data  files.  At  the  time,  only  4  or  5  uiu  versifies  had  an  organized  data  library 
service. 

For  the  Quebec  universities  (francophone  part  of  Canada),  the  CARL  Consortium  was  like 
a  "Pandora's  box".  Our  libraries  didn't  really  know  how  to  react.  To  use  a  popular 
expression,  the  Quebec  universities  and  some  other  Canadian  universities  were  caught 
"between  a  rock  and  a  hard  place". 

When  we  received  the  tapes,  we  really  did  not  know  what  to  do  with  them.  For  the 
majority  of  universities,  the  first  operation  consisted  of  sending  the  tapes  to  the  computing 
service.  From  that  time  onward  and  thru  the  listservs  on  the  Internet,  the  library  and  its 
librarians  have  discovered  they  will  have  to  play  an  active  role  in  using  computer  files  in 
the  future. 

As  a  group,  the  Quebec  university  libraries  decided  to  work  together  and  we  created  a 
small  woricing  group  on  numeric  data  files.  This  working  group  responds  to  a  university 
library  directors  committee  of  the  Conference  of  rectors  and  principals  of  Quebec 
universities.  The  working  group  on  numeric  data  files  has  been  in  operafion  since  1990. 
To  date  our  main  activity  has  been  to  organize  training  sessions  to  familiarize  librarians 
with  data. 

On  the  occasion  of  the  first  workshop,  we  proposed  an  activity  for  the  library  directors 
themselves.  The  day  preceding  the  workshop,  our  main  spokesperson  (a  well  known  data 
librarian)  accepted  to  meet  with  them.  In  the  space  of  a  few  hours,  library  directors  were 
briefed  on  the  administrative  aspects  of  a  data  library  and  the  impact  on  library  activities. 
Once  they  knew  what  was  involved,  they  realized  they  had  no  other  options.  Their 
libraries  must  offer  access  to  data  files.  They  know  it  will  not  be  so  fast! 

Since  that  time,  each  library  director  has  selected  a  "volunteer"  in  his  organization,  a 
person  whose  role  it  is  to  become  a  data  librarian.  Last  month  in  Quebec  City,  we  held  a 
one-day  meeting  with  these  future  data  librarians  from  seven  universities  On  this 
occasion,  we  discussed  the  development  of  the  service  in  each  of  our  institutions. 

These  are  factors  from  external  context. 


Fall/Winter  1993 


1.2  Internal  factors 

In  my  own  institution,  internal  factors  such  as  acquisition  of  data  files  by  individual 
researchers  or  demands  directed  to  the  library  for  data  acquisition  convinced  us  to  examine 
the  situation  in  greater  detail.  We  discovered  at  that  time  that  ICPSR  data  was  in  demand 
to  the  point  that  some  of  our  social  science  researchers  were  going  to  ICPSR  site  on 
sabbatical  leave  to  have  access  to  ICPSR  data. 

Another  consideration  which  had  an  important  impact  in  the  slow  starting-up  period  of  our 
data  library  is  that  the  library  has  to  implement  an  on-line  catalogue  called  ARIANE.  In 
the  past  two  years,  the  focus  of  our  activity  has  been  geared  towards  this  objective.  Almost 
all  other  projects  were  put  on  "hold".  Sometimes  ARIANE  is  my  enemy,  other  days  it  is  a 
friend.  Art  enemy,  because  it  has  temporarily  stopped  the  implementation  of  a  data  library 
service  and  a  friend,  because  it  has  brought  microcomputers  and  a  team  of  computer 
experts,  dedicated  to  improving  library  services.  We  are  now  moving  toward  another 
phase:  life  after  the  on-line  catalogue. 

Another  important  internal  factor  to  be  considered  here  is  the  computing  environment  in 
my  institution.  It  is  still  a  highly  centralized  environment.  A  large  group  of  computer 
experts  (in  my  uiuversity;I  don't  know  very  many  of  them  in  other  universities)  wish  to 
remain  the  masters  of  the  mainframe.  In  fact,  they  don't  want  outsiders  such  as  myself  to 
use  their  keys.  It  is  not  easy  to  bring  the  data  needs  of  the  users  to  their  attention.  They  seem 
to  be  afraid  of  the  intrusion  of  researchers.  Some  of  them  seem  to  be  more  machine  oriented 
than  user-services  or  needs-oriented. 

1  have  to  add  that  in  my  own  institution,  until  two  years  ago,  when  OP  AC  was  created,  there 
were  no  formal  links  established  between  the  library  and  the  computing  service,  no 
tradition  of  collaboration  existed.  It  has  been  a  difficult  task  to  convince  them  to  cooperate 
with  us  in  this  venture. 

2  Why  conduct  a  survey  on  user's  needs? 

As  a  librarian  I  was  interested  to  try  to  define  the  role  of  my  university  library  with  respect 
to  computer  files.  I  also  tried  to  convince  some  of  my  colleagues  (subject  librarians).  At  the 
outset,  nobody  wanted  to  venture  outside  the  well-known  bibliographical  field. 

At  that  time,  because  the  library  had  no  formal  Links  with  the  computing  service,  I  decided 
on  my  own  to  conduct  a  survey  of  researchers  interested  in  using  computer  files.  At  the 
same  time,  I  begin  to  establish  relationships  with  the  computer  experts.  The  main  objective 
of  the  survey  was  to  determine  the  needs  of  professors  and  researchers  with  respect  to  data 
files  and  to  get  some  basic  information  which  the  library  could  use  to  begin  formal 
discussions  with  the  computer  center.  You  should  know  that  in  terms  of  service  it  was  at 
level  0  on  the  campus.  The  library  knew  nothing  about  data,  and  had  no  tradition  with 
computer  files. 

First,  we  needed  to  identify  the  users.  As  there  was  no  data  library  service  on  the  campus, 
we  knew  only  that  there  were  probably  few  researchers  using  computer  files  somewhere 
on  the  campus. 


52  lASSIST  Quarterly 


Second,  we  needed  to  know:  where  are  they  located?  How  many  are  there?  In  which  fields 
or  disciplines?  As  the  social  science  librarian,  I  knew  that  some  p)olitical  scientists, 
economists  and  sociologists  and  some  others  on  campus  were  using  data. 

Third,  we  needed  to  know:  How  do  they  work  with  data?  How  do  they  use  the  research 
data?  What  are  the  main  difficulties  they  encounter?  Which  data  sources  do  they  use?  Do 
they  have  any  problems  in  obtaining  their  research  information?  What  do  they  do  with 
their  own  data?    Do  they  have  access  to  assistance  or  technical  support? 

3  Methodology 

3.1  Population 

As  we  didn't  know  who  were  the  users  of  computer  files  or  how  many  they  were,  in  which 
departments,  faculties  or  research  centers  they  were  located,  we  decided  to  mail  the 
questionnaire  to  every  professor  on  the  campus  (N  =  1450).  We  also  published  an 
announcement  in  the  campus  newspaper  to  invite  other  respondents  (graduate  students 
using  data  files)  to  participate  to  the  survey. 

We  were  expecting  to  receive  between  40-50  responses.  Much  to  our  surprise,  near  by  140 
persons  spread  throughout  15  schools  and  faculties  returned  the  questionnaire.  This 
operation  allowed  us  to  identify  the  real  clientele  using  data  files  at  Laval  and  also  to 
determine  the  range  of  the  potential  users  in  the  future. 

The  response  rate  9.5%  (N  =  140)  may  seem  to  be  very  low.  Part  of  the  explanation  is  that 
in  fact,  among  all  the  professors  at  Laval,  only  a  few  are  specialized  in  quantitative 
methods.  We  believe  that  the  core  group  of  users  of  data  files  for  teaching  and  research  can 
be  found  among  them. 

Using  the  same  questionnaire,  the  survey  has  been  replicated  in  another  university 
(University  of  Quebec  at  Montreal).    The  response  rate  in  Montreal  has  been  similar 

(10.5%). 

3.2  Questionnaire 

Before  constructing  the  questionnaire,  as  a  good  librarian,  I  ran  a  literature  survey  to 
become  familiar  with  the  content  and  the  terminology  involved.  This  exercise  led  me  to 
articles  from  the  lASSIST  Quarterly.  At  that  time,  I  discovered  not  only  this  association 
but  also  the  articles  of  data  librarians  such  as  Judith  Rowe,  Sue  Dodd,  Laine  Ruus,  Ray 
Jones,  Chuck  Humphrey,  etc. 

The  survey  was  conducted  by  sending  out  a  questionnaire.  The  questionnaire  was 
designed  around  five  groups  of  questions: 

-Use  of  research  data 
-Sources  of  data 
-Technical  support 
-Questions  related  to  financing 
-Personal  data 


FallAVinter  1993  53 


You  will  note  that  the  questionnaire  was  conducted  in  French,  the  official  language  of  the 
university.  For  those  who  are  interested,  I  can  provide  them  with  a  copy  ofthe  questionnai- 
re. 

After  completion  of  the  first  version  of  the  questionnaire,  we  tested  it  with  some  colleagues 
(librarians)  and  with  three  professors  from  three  different  faculties. 


We  announced  the  forthcoming  questionnaire  to  professors  through  a  letter  sent  to 
directors  of  departments  in  the  humanities  and  social  sciences  sectors.  In  the  social 
sciences  for  example,  many  directors  requested  that  their  professors  reply  to  the  survey. 

No  other  reminders  were  sent  out.  The  data  collected  was  analyzed  with  SAS. 

4     Highlights 

Some  results  of  the  survey  of  users'  needs. 

4.1  Use  of  data  by  researchers 

The  first  group  of  questions  was  related  to  the  use  of  data  files  by  researchers. 

-First  we  asked  them  if  they  used  data  files  regularly,  occasionally,  rarely  or  never  (but 
interested  in  future  usage). 


MRDF  Usage 


never 
39% 


occasjonnaly 
1  9% 


regular 
35% 


54 


lASSIST  Quarterly 


Among  the  users  (59%  of  respondents),  we  also  tried  to  find  out  which  types  of  data  they 
used  (numeric  data  files,  textual  data  files,  numeric  and  textual  or  any  other  types  of  data. 


Types  of  Data 

1 

'^^ 

21% 

=  numeric 
H  textual 
a  others 

64% 

Another  question  dealt  with  frequency  of  access  to  data.  Do  they  access  their  files  weekly, 
monthly  or  once  a  year? 


irregular 
42% 

Frequency  of  Data  Use 

monthly 

1 

^S'""' 

\ 

^Z^ 

week], 
45% 

FallAWnter  1993 


55 


-  Among  the  users  of  data  files,  the  majority  of  them  (85%)  create  a  subset  for  analysis. 
This  may  be  of  no  surprise  to  those  of  you  with  experience  in  data  libraries,  but  it  was 
a  revelation  for  me,  a  newcomer  at  that  tijme  to  data  management. 

-  T7%  of  them  regularly  use  data  to  write  journal  articles,  books  or  chapters  of  book  or 
to  prepare  reports. 

-  In  the  last  three  years  preceding  the  survey,  66%  of  respondents  were  master's  theses 
directors  and  37%,  Ph.D.  thesis  directors  of  candidates  using  electronic  data  (numeric, 
textual). 

-  By  order  of  importance,  the  main  reasons  which  limit  the  access  to  and  the  use  of 
computer  files  in  my  institution  at  the  time  of  the  survey. 


The  most  important  reason: 
Second  reason: 
Third  reason: 
-The  other  reasons  mentioned  were 


lack  of  money  to  buy  the  files; 

insufficient  training  for  using  the  data; 

difficulties  of  gaining  access  to  data. 

lack  of  money  for  the  analysis,  no  retrieval 
software  available,  insufficient  support  for 
statistical  and  computer  analysis,  difficulties 
in  retrieving  the  data. 

All  these  reasons  proved  to  me  that  there  is  a  dramatic  loss  of  expertise  in  the  institution. 
Each  researcher  has  to  perform  his/her  own  procedure.  The  difficulties  of  gaining  access 
to  data  conduct  the  researcher  to  concentrate  more  on  the  tools  or  on  the  procedure/ 
process  to  get  the  data  instead  of  concentrating  on  the  data  and  on  its  analysis. 

4.2  Sources  of  data 

The  second  group  of  questions  was  related  to  the  sources  of  data.  Where  does  the  data 
come  from? 

-In  the  last  two  years  preceding  the  survey,  almost  half  of  the  respondents  (43%) 
individually  contacted  Statistics  Canada  for  electronic  data.  Stat  Can  is  the  main 
producer  of  data  in  Canada.  We  discovered  that  different  groups  of  researchers 
Dought  the  same  files  twice  or  more.  For  example,  there  are  5  copies  of  a  survey  called 
Sante  Quebec  (Health  Quebec)  on  the  campus  among  the  researchers. 

-  At  that  time  the  researchers  learned  of  the  existence  of  data  files  mainly  by  (in  order 
of  importance): 


Colleagues 

38,5% 

Journal  articles 

29,2% 

Data  producers 

21,5% 

Others  ways 

10,8% 

56 


■ASSIST  Ouarteriy 


-  Consulting  a  list  or  a  directory  is  the  last  one. 


-  Humanists  and  social  scientists  informed  us  they  need  data  from  ICPSR.  Our 
institution  is  not  yet  an  ICPSR  member.  I  know  that  one  of  our  departments  has 
recentiy  lost  a  well-known  professor  because  we  were  not  an  ICPSR  site. 

-  As  to  the  question  about  what  were  their  most  important  sources  of  data?  the  survey 
revealedthese  sources  to  be: 


Governmental  organizations 


Files  created  locally  by 
the  researchers 


Commercial  firms 


49.3% 


34.3% 


14.9% 


^.STechnical  support 

The  third  group  of  questions  was  related  to  technical  support.  We  made  a  distinction  bet- 
ween computer  assistance  and  statistical  support. 

-  Support  for  computer  problems.  From  whom  do  you  obtain  assistance  for  computer 
problems;  From  your  computing  service?  From  colleagues?  Or  do  you  solve  your 
problems  yourself?  Only  35%  get  support  from  the  computer  center. 


49.00% 

50.00%-r, 


computer  assistance 


35  00% 


j^.....y,..^;y,y^^... 


Colleagues 


Comp. 
Centre 


16.00% 


Self-help 


FallAVinter  1993 


57 


-  Support  for  statistical  analysis.  We  asked  the  same  kind  of  question  for  statistical 
help.  For  this  aspect,  the  technical  support  comes  from  many  different  sources. 


Statistical  Support 


45.00% 
■45,00% -|| 

40.00%  -  I 

35.00% -'I 

30.00%  -  I 

25.00%  -  I 

20.00%  % 

15.00% 

10.00% 

5.00% 

0.00% 


38.00% 


■1  50% 


3.00% 


Colleagues 


Others 


Computer 
Centre 


-  The  statistical  package  most  often  used  is  SAS  (55%),  followed  by  SPSS  (35%).  SAS  is 
supported  by  computing  center.  About  10  other  packages  have  be  cited  (Lotus,  Stat 
view.  Excel,  Systat,  Rats)  mostly  working  on  micros. 

-  We  also  asked  them,  once  you  have  used  the  data,  where  do  you  keep  the  original 
data? 

-  at  computer  center:  (42%) 

-  the  others  at:  their  office,  department,  laboratory  and  even  at  home. 
4.4  Financing 

Two  questions  were  asked  regarding  the  funding 

-  Which  source  of  funding  allows  you  to  buy  the  files  you  use? 


Department/faculty  budget 

17% 

Research  grants 

62% 

Library  budget                                    1,5% 

Free  data 

10,5% 

Other  sources 

9% 

58 


lASSIST  Quarterly 


-The  majority  of  respondents  (58%),  said  that  they  were  ready  to  participate  financially; 
with  help  from  their  faculty  budgets,  from  their  grants  in  the  acquisition  of  computer 
files  which  would  be  deposited  in  a  data  library  facility.  They  proposed  some  other 
means  of  financing  such  as  using  a  percentage  of  their  grants,  charging  user  fees,  or 
encouraging  the  participation  of  research  units. 


4.5  Personal  data 

Some  questions  were  related  to  their  own  data,  new  data  created  by  the  researchers 
themselves,  not  the  public  data  they  buy.\ 

-  Archiving  of  data  created  by  the  users. 

We  asked  them:  where  do  you  keep  the  data  you  created?  Do  you  deposit  the  data  in 
the  computer  service,  in  your  department  or  do  you  keep  it  stored  in  your  office? 


research 

centres 

1  7% 

computer  i^Siii 
centre     ^^^ 

Data  Archiving 

many  places 
1  2% 

1 

office 
52% 

A  similiar  archiving  pattern  emergesfor  privately  held  data  as  for  the  public  data.  We  can 
only  conclude  that  the  office  is  not  a  very  good  storage  location. 

66%  of  the  respondents  reported  they  were  prepared  to  deposit  their  own  files  in  a  data 
library  service.  Among  them,  52%  were  willing  to  write  a  codebook  to  facilitate  access  to 
their  files.  Probably  they  don't  know  the  job  involved. 


Fall/Winter  1993 


59 


At  the  end  of  the  questionnaire,  we  asked  the  potential  users  to  suggest  which  computer 
files  the  institution  must  buy.  These  suggestions  will  become  part  of  the  basic  collection 
which  we  have  begun  to  develop. 

5  After  the  survey 

I  would  like  to  recount  some  of  the  steps  which  have  been  taken  following  the  survey.  The 
information  collected  by  the  survey  has  facilitated  the  decisions  and  made  these  steps 
possible. 

5.1  Feedback  to  the  respondents 

First,  to  thank  the  respondents,  we  sent  them  a  document  presenting  the  main  results.  I 
have  some  copies  with  me,  in  French. 

5.2  Report  to  the  library  administration 

Secondly,  we  used  the  information  gathered  in  the  survey  to  prepare  a  report  to  the  library 
administrators.  This  report  was  deposed  in  December  1991  and  it  recommended  the 
creation  of  a  data  library  service  on  the  campus.  The  report  was  been  discussed  and  partly 
accepted. 

The  first  action  taken  by  the  library  administration  was  to  designate  a  librarian  as  a  part- 
time  data  library  coordinator  (about  two  days  a  week)  and  to  encourage  the  training  of  this 
person.  Last  year  I  had  the  privilege  to  be  selected  for  the  one-week  ICPSR  training.  It  is 
an  immersion  which  I  recommend  to  any  newcomer.  Course  is  given  by  the  three  data 
amigos.  Two  of  them  are  present  at  this  conference. 

Another  action  which  was  taken  is  related  to  the  acquisition  of  data  files.  Because  the  most 
important  reason  which  seems  to  limit  the  access  of  data  was  lack  of  money  to  buy  the  files, 
the  report  recommended  the  creation  of  a  distinct  library  acquisitions  budget  for  computer 
files.  We  obtained  this  concession.  Even  if  we  are  not  yet  fully  operational,  we  have  begun 
to  develop  a  basic  collection.  First  we  began  with  from  the  core  suggestions  made  by  the 
respondents  to  our  survey.  Secondly,  we  regularly  receive  written  proposals  from  the 
researchers. 

With  one  faculty,  we  began  to  develop  a  shared  acquisition  policy  for  very  specialized 
data.  Both  the  library  and  the  faculty  pay  for  the  data.  We  also  asked  the  subject  librarian 
in  the  field  to  participate  in  the  negotiation  process.  She  can  thus  become  familiar  with  data 
in  the  discipline  of  her  clientele.  It  is  a  kind  of  training  for  her. 

According  to  acquisition  of  files,  the  next  step  will  be  to  join  ICPSR. 

The  report  recommending  the  creation  of  a  data  library  service  has  been  judged  important 
enough  to  include  it  in  the  over-all  objectives  of  the  library.  The  data  library  project  is  now 
part  of  the  official  five-year  plan  (1992-1997)  for  our  library. 

We  are  now  slowly  moving  toward  the  creation  of  a  data  library  service  without  any 
additional  human  resources. 


60  lASSIST  Quarterly 


5.3  Proposal  to  the  Computer  Committee 

A  few  months  after  the  presentation  of  the  report  to  the  library  administration,  we 
undertook  a  preliminary  analysis  with  a  member  of  the  computer  centre.  In  this  analysis 
we  tried  to  define: 

-the  sharing  of  responsibilities  between  the  library  and  the  computer  service; 

-the  definition  of  a  user-friendly  system  which  we  would  like  to  acquire  or  to  develop. 

On  this  basis,  a  formal  proposal  was  presented  to  the  computer  priorities  committee  of  the 
university  for  1992-1993.  This  is  where  the  computer  "pie"  is  shared.  The  Ubrary  and  the 
Computer  service  were  authorized  to  pursue  their  discussions  in  this  direction.  As  results 
of  the  analysis,  a  report  proposing  a  few  suggestions  for  development  were  tabled  and 
given  to  the  library  administration  last  month.  A  decision  will  be  made  on  these 
suggestions.  We  propose  to  develop  a  local  system  (using  the  software  Oracle)  based  on 
a  Workstation  with  UNIX.  It  will  be  a  small-size  system  at  the  beginning,  with  possibilities 
for  expansion. 

On  the  other  hand,  at  the  bottom  line,  I  have  also  tried  to  convince  some  of  the  computer 
experts  of  the  importance  of  data  for  research  at  the  university.  It  has  taken  me  two  and 
a  half  years  to  obtain  a  resource  person  from  the  user  service.  This  person  is  presently  only 
allowed  to  make  copies  of  tapes.  We  expect  to  be  able  to  make  better  progress  as  time  goes 
on.  The  timid  cooperation  of  the  user  service  in  the  computer  centre  forced  the  library  to 
consider  using  graduate  students  familiar  with  computers  and  statistics  to  accomplish 
basic  tasks  related  to  management  of  the  files  and  eventually  to  help  with  data  extraction. 

Conclusion 

The  information  gathered  by  the  survey  has  helped  me  to  inform  the  library  administration 
and  the  Computer  Centre  about  the  role  we  will  have  to  play  in  regard  to  data  on  our 
campus.  On  campus,  the  researchers  begin  to  know  the  Ubrary  is  going  in  the  same 
direction  as  they  are.  Some  of  them  have  already  forged  ahead.  Before  developing  the 
system,  we  will  again  consult  them  to  find  out  how  to  best  serve  their  needs  with  respect 
to  data  accessing. 


Fall/Winter  1993  61 


lASSIST 


INTERNATIONAL  ASSOCIATION  FOR 
SOCIAL  SCIENCE  INFORMATION 
SERVICE  AND  TECHNOLOGY 

•  •  •  • 
ASSOCIATION    INTERNATIONALE 
POUR        LES        SERVICES        ET 
TECHNIOUES    D'INFORMATION    EN 
SCIENCES  SOCIALES 


Membership 
form 


The  International  Association  for  So- 
cial Science  Information  Services  and 
Technology  (lASSIST)  is  an  interna- 
tional association  of  individuals  who 
are  engaged  in  the  acquistion,  process- 
ing, maintenance,  and  distribution  of 
machine  readable  text  and/or  numeric 
social  science  data.  The  membership 
includes  information  system  special- 
ists, data  base  librarians  or  administra- 
tors, archivists,  researchers,  program- 
mers, and  managers.  Their  range  of 
interests  encompases  hard  copy  as  well 
as  machine  readable  data. 

Paid-up  members  enjoy  voting  rights 
and  receive  the  lASSIST  QUAR- 
TERLY. They  also  benefit  from  re- 


duced fees  fOT  attendance  at  regional 
and  international  conferences  spon- 
sored by  lASSIST. 

Membership  fees  are: 
Regular  Membership.  $40.00  per 
calendar  year. 

Student  Membership:  $20.00  per 
calendar  year. 

Institutional  subcriptions  to  the  quar- 
terly are  available,  but  do  not  confer 
voting  rights  or  other  membership 
benefits. 

Institutional  Subcription: 
$70.00  per  calendar  year  (includes 
one  volume  of  the  Quarterly) 


I    I  would  like  to  become  a  member  of 
lASSIST.  Please  see  my  choice  below: 

□  $40  Regular  Membership 

□  $20  Student  Membership 

□  $70  Institutional  Membership 
My  primary  Interests  are: 

□  Archive  Services/Adminislialion 

□  Data  Processing 

□  Data  Management 

□  Research  Applications 

□  Other  (specify) 


Please  make  checks  payable 
to  lASSIST  and  Mall  to  : 
Mr.  Marty  Pawlocki 
Treasurer,  lASSIST 
%  303  GSLiS  Building, 
Social  Science  Data 
Archives,  University  of 
Caiifornia,  405  Hllgard 
Avenue,  Los  Angeles,  CA 
90024-1484 


Name /title 


InstKutlonal  Atfillatlon 


Mailing  Address 


City 


Country  /  zip/  postal  code  /  phone 


l_. 


^1^ 


-T)   r^  <  -^ 


814  PD^L  JHU 


'>0» 

aiSSfri 

C 

III 

Vi 

mmA 

s 

^ 

m 

CD 
til 

